Murray Altheim wrote:
Stefan Wachter wrote:
Hi Daniel,
I am using NetBeans 3.6 which certainly is unicode aware. Yet, NetBeans seems not to detect that the source files of Lucene are UTF-8 encoded automatically. I guess that it uses the platform specific default encoding which is ISO-8859-1 for my Linux operating system.
In linux you can set the default encoding both at platform-level, at a user-level, and for individual applications. You're not forced to stay within ISO-8859-1. Think about it this way: if that were the case, how on a multi-user system like linux could a machine support only one encoding? This sounds more like a NetBeans problem than a OS problem. I don't use NetBeans, but there must be a way to indicate the encoding beyond what your particular user settings are. Otherwise, English programmers couldn't develop non-English programs, which is hard to believe.
I can tell the NetBeans-IDE the encoding of every single source file. But the problem is that I might not know which the correct encoding is. In case of Lucene it is quite clear because it is mentioned in the build.xml file. But what is the situation if someone sends you a stemmer class for example for Swahili and you do not know in which encoding the author wrote the source. Then you can try lots of encodings until the java compiler will be satisfied with it. And even then you might not be sure that you used the right encoding.
I think what Java lacks is a means to indicate the encoding of source files (e.g. <?java encoding="ISO-8859-1"?> in a XMLish way). The encoding has to be fed into the system from the outside. What else could be the reason for having an encoding switch to the java compiler? Therefore I think it is best to have Java source files to be plain ASCII.
Java has quite a lot of localization features built into the language. Yes, the encoding has to be specified, just as one would have to tell any processor how to decode any given set of bytes. Java itself is Unicode aware for anything dealing with characters. For dealing with byte streams the encoding has to be specified. Here's a good article on the subject:
http://www.jorendorff.com/articles/unicode/java.html
As for crippling files by forcing them into plain ASCII, why would we want to step back 20 years in computer science? It's been a long-fought battle to get to where we are now, and the desires of a few people to be able to look at a file in ASCII are far outweighed by the rest of the world, whose languages don't fit into that straitjacket. As was mentioned, it would make the code a great deal harder to both read and manage.
I remember looking at a desktop publishing application developed at StoneHand in 1996 that had Arabic, Gujarati, Japanese, Chinese, English, and Hebrew on the screen at the same time and thinking damn! pretty impressive! We now have that kind of thing in our browsers and think little of it. I'd hate to step back to pre-1996 again.
We should all be using Unicode-aware tools. It's what the rest of the world is doing, even in the Anglocentric US. For an international project like Lucene, there's no good reason to step back in time to ASCII. There are many programmers using the Lucene source code that have no problem with Unicode, and it would not be in their interest to be suddenly reading numeric character entities rather then normally-readable text.
Murray
Of course I also like all the unicode awareness of Java. In fact I wrote a Java-XML-Databinding including an XML parser (cf. www.jbind.org) that benefitted much of this awareness. In XML, there is a cleary defined mechanism how the file encoding can be determined (looking at the first 4 bytes). However, in Java there isn't such a mechanism. If I get some sources from somewhere and I want to compile them then I must know their encoding. If there are different encodings for different sources in a project then I have to be careful to call the compiler several times with changing encoding switches.
Therefore it would be great if all Java programmers would agree on the same encoding of source files (let it be UTF-8, ISO-8859-1 or something really exotic). This has nothing to do with the display - it is just the file encoding. Of course this is not realistic. So why not using just ASCII encoding with the amendment of the \u escape? Of course you opposed that then the sources are not so readable. But progamming books teach me to factor out the text parts from programm code.
--Stefan
...................................................................... Murray Altheim http://kmi.open.ac.uk/people/murray/ Knowledge Media Institute The Open University, Milton Keynes, Bucks, MK7 6AA, UK .
[International Committee of the Red Cross director] Kraehenbuhl pointed out that complying with international humanitarian law was "an obligation, not an option", for all sides of the conflict. "If these rules or any other applicable rules of international humanitarian law are violated, the persons responsible must be held accountable for their actions," he said. -- BBC News http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm
"In my judgment, this new paradigm [the War on Terror] renders obsolete Geneva's strict limitations on questioning of enemy prisoners and renders quaint some of its provisions [...] Your determination [that the Geneva Conventions] does not apply would create a reasonable basis in law that [the War Crimes Act] does not apply, which would provide a solid defense to any future prosecution." -- Alberto Gonzalez, appointed US Attorney General, and likely Supreme Court nominee, in a memo to George W. Bush http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]