Actually, in Netbeans 4.0 although you can tell the editor the encoding for each file individually, that doesn't solve your problem. At least for standard projects, you can only provide the required compiler switch (-encoding UTF-8) at the project level. I ran into this same issue building Lucene and had to set the compiler encoding to be uniformly UTF-8 for the entire Lucene source tree, which fortunately works.
There are a number of ways to work around this. E.g., one could put a non-UTF 8 analyzer into a dependent separate project. These are of course Netbeans issues and not Lucene issues. Re. Lucene, if not automated, there should at least be a readme entry or something that identifies the necessary file encodings (which could just say that UTF8 is the required encoding to compile Lucene). When I first tried to build Lucene, because of the default ISO-8859-1 encoding, I got errors concerning illegal character objects. I didn't know what encoding was required, or even for sure that my problem was an encoding issue, so I asked a question on this list. It would be better if this was more apparent. Chuck > -----Original Message----- > From: Murray Altheim [mailto:[EMAIL PROTECTED] > Sent: Friday, November 26, 2004 2:29 PM > To: Lucene Developers List > Subject: Re: encoding of german analyzer source files > > Andi Vajda wrote: > >>I can tell the NetBeans-IDE the encoding of every single source file. > But the > >>problem is that I might not know which the correct encoding is. In > case of > >>Lucene it is quite clear because it is mentioned in the build.xml file. > But > >>what is the situation if someone sends you a stemmer class for example > for > >>Swahili and you do not know in which encoding the author wrote the > source. > >>Then you can try lots of encodings until the java compiler will be > satisfied > >>with it. And even then you might not be sure that you used the right > >>encoding. > > > >>Therefore it would be great if all Java programmers would agree on the > same > >>encoding of source files (let it be UTF-8, ISO-8859-1 or something > really > > > > Actually, the reason for the change to utf-8 was that for Lucene to > compile on > > Windows with gcj (mingw), the encoding better be utf-8 because of the > typical > > absence of iconv facility there. Therefore, it would be safe to assume > the > > swahili stemmer source to also be encoded in utf-8. > > > > Andi.. > > Andi, > > It may seem pretty safe to assume from practice, but from the Java > programmer's point of view, it's still not. It's perfectly possible > that the Swahili file be in UTF-8 or UTF-16, little-endian or big- > endian, or perhaps some other encoding we don't even know about. > A minor point I was trying to make is that absent some external > mechanism, there's really *no way* to know the encoding of a file. > You can sniff the first few bytes (which is what is recommended > in the XML 1.0 spec, you can see how they do it there), but making > such an assumption may lead to program failure if the assumption > is incorrect. > > Extensible Markup Language (XML) 1.0 (Third Edition) > Appendix F Autodetection of Character Encodings > http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing > > The suggestions there are pretty usable for files that have nothing > to do with XML. > > I don't know how many people on this list are familiar with > O'Reilly's "CJKV Information Processing" (with the puffer fish on > the cover), which opened up my eyes to a new world. After reading > it I got a terrible fright and couldn't sleep for weeks. > > "CJKV Information Processing: Chinese, Japanese, Korean > & Vietnamese Computing", by Ken Lunde, O'Reilly Publishing. > http://www.oreilly.com/catalog/cjkvinfo/index.html > http://www.amazon.com/exec/obidos/tg/detail/-/1565922247/002-2766986- > 0676059?v=glance&vi=reviews > > Murray > > ...................................................................... > Murray Altheim http://kmi.open.ac.uk/people/murray/ > Knowledge Media Institute > The Open University, Milton Keynes, Bucks, MK7 6AA, UK . > > [International Committee of the Red Cross director] Kraehenbuhl > pointed out that complying with international humanitarian law > was "an obligation, not an option", for all sides of the conflict. > "If these rules or any other applicable rules of international > humanitarian law are violated, the persons responsible must be > held accountable for their actions," he said. -- BBC News > http://news.bbc.co.uk/1/hi/world/middle_east/4027163.stm > > "In my judgment, this new paradigm [the War on Terror] renders > obsolete Geneva's strict limitations on questioning of enemy > prisoners and renders quaint some of its provisions [...] > Your determination [that the Geneva Conventions] does not apply > would create a reasonable basis in law that [the War Crimes Act] > does not apply, which would provide a solid defense to any future > prosecution." -- Alberto Gonzalez, appointed US Attorney General, > and likely Supreme Court nominee, in a memo to George W. Bush > http://www.adamhodges.com/WORLD/docs/gonzales_memo.pdf > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]