Re: encoding of german analyzer source files

Stefan Wachter Fri, 26 Nov 2004 05:47:30 -0800

Murray Altheim wrote:

Stefan Wachter wrote:
Hi Daniel,
I am using NetBeans 3.6 which certainly is unicode aware. Yet, NetBeans seems not to detect that the source files of Lucene are UTF-8 encoded automatically. I guess that it uses the platform specific default encoding which is ISO-8859-1 for my Linux operating system.
In linux you can set the default encoding both at platform-level,
at a user-level, and for individual applications. You're not forced
to stay within ISO-8859-1. Think about it this way: if that were
the case, how on a multi-user system like linux could a machine
support only one encoding? This sounds more like a NetBeans problem
than a OS problem. I don't use NetBeans, but there must be a way to
indicate the encoding beyond what your particular user settings are.
Otherwise, English programmers couldn't develop non-English programs,
which is hard to believe.

I can tell the NetBeans-IDE the encoding of every single source file. But the problem is that I might not know which the correct encoding is. In case of Lucene it is quite clear because it is mentioned in the build.xml file. But what is the situation if someone sends you a stemmer class for example for Swahili and you do not know in which encoding the author wrote the source. Then you can try lots of encodings until the java compiler will be satisfied with it. And even then you might not be sure that you used the right encoding.

I think what Java lacks is a means to indicate the encoding of source files (e.g. <?java encoding="ISO-8859-1"?> in a XMLish way). The encoding has to be fed into the system from the outside. What else could be the reason for having an encoding switch to the java compiler? Therefore I think it is best to have Java source files to be plain ASCII.

Java has quite a lot of localization features built into the
language. Yes, the encoding has to be specified, just as one
would have to tell any processor how to decode any given set
of bytes. Java itself is Unicode aware for anything dealing
with characters. For dealing with byte streams the encoding
has to be specified. Here's a good article on the subject:

   http://www.jorendorff.com/articles/unicode/java.html

As for crippling files by forcing them into plain ASCII, why
would we want to step back 20 years in computer science? It's
been a long-fought battle to get to where we are now, and the
desires of a few people to be able to look at a file in ASCII
are far outweighed by the rest of the world, whose languages
don't fit into that straitjacket. As was mentioned, it would
make the code a great deal harder to both read and manage.

I remember looking at a desktop publishing application
developed at StoneHand in 1996 that had Arabic, Gujarati,
Japanese, Chinese, English, and Hebrew on the screen at the
same time and thinking damn! pretty impressive! We now have
that kind of thing in our browsers and think little of it.
I'd hate to step back to pre-1996 again.

We should all be using Unicode-aware tools. It's what the rest
of the world is doing, even in the Anglocentric US. For an
international project like Lucene, there's no good reason to
step back in time to ASCII. There are many programmers using
the Lucene source code that have no problem with Unicode, and
it would not be in their interest to be suddenly reading
numeric character entities rather then normally-readable text.

Murray

Of course I also like all the unicode awareness of Java. In fact I wrote a Java-XML-Databinding including an XML parser (cf. www.jbind.org) that benefitted much of this awareness. In XML, there is a cleary defined mechanism how the file encoding can be determined (looking at the first 4 bytes). However, in Java there isn't such a mechanism. If I get some sources from somewhere and I want to compile them then I must know their encoding. If there are different encodings for different sources in a project then I have to be careful to call the compiler several times with changing encoding switches.

Therefore it would be great if all Java programmers would agree on the same encoding of source files (let it be UTF-8, ISO-8859-1 or something really exotic). This has nothing to do with the display - it is just the file encoding. Of course this is not realistic. So why not using just ASCII encoding with the amendment of the \u escape? Of course you opposed that then the sources are not so readable. But progamming books teach me to factor out the text parts from programm code.

--Stefan


......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   [International Committee of the Red Cross director] Kraehenbuhl
   pointed out that complying with international humanitarian law
   was "an obligation, not an option", for all sides of the conflict.
   "If these rules or any other applicable rules of international
   humanitarian law are violated, the persons responsible must be
   held accountable for their actions," he said. -- BBC News
   http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm

  "In my judgment, this new paradigm [the War on Terror] renders
   obsolete Geneva's strict limitations on questioning of enemy
   prisoners and renders quaint some of its provisions [...]
   Your determination [that the Geneva Conventions] does not apply
   would create a reasonable basis in law that [the War Crimes Act]
   does not apply, which would provide a solid defense to any future
   prosecution." -- Alberto Gonzalez, appointed US Attorney General,
   and likely Supreme Court nominee, in a memo to George W. Bush
   http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: encoding of german analyzer source files

Reply via email to