lukas, I believe what i'm about to tell you is correct... ;]
INDEXING: UTF-8 is an encoding of Unicode chars, as is UCS-2. Java uses UCS-2 internally. So once the ISO-8859-2 characters are read in from disk (specifying the proper encoding), they are in a Unicode format (UCS-2) in memory. If you are using HTML character entities in the original content, they need to be converted to their unicode equivalent after reading in from disk. The different Unicode encodings (UTF-8, UTF-16, UCS-2, etc) use different lengths and byte orders for their character encodings, so you'll have to check what the byte order and char length is for the UCS-2 encoding... (æ -> 0x00E6 - assuming UCS-2 is 2 bytes with the leftmost being the high-order byte) Lucene writes out its index files in UTF-8 - you don't need to worry about this. (just know that it supports all the chars you'll want to use.) SEARCHING When reading the query string, you need to specify the correct encoding for the specific platform (otherwise the chars will get mapped incorrectly). As long as they're read into Java correctly, everything should be ok. This might be a problem if the search is web-based. the browser, server, etc. have to negotiate a common encoding, which might not support certain chars you need to search on. you could use char entities for these and convert them like you did when reading the HTML files. this requires a special user interface to enter the entities. ANALYZERS Whatever analyzer you use should use unicode chars as well. Unicode literals may be embedded in Java source files as '\u00E6'. you will want to include the HTML character entity conversion in the analyzer, to ensure it's done the same way in both places (indexing/searching). -doug > -----Original Message----- > From: Lukas Zapletal [mailto:[EMAIL PROTECTED]] > Sent: Friday, January 03, 2003 4:19 AM > To: [EMAIL PROTECTED] > Subject: Converting ISO88592 files to UTF8 and indexing`em > > > Dears, > > I have a problem. I need to index Czech content that is in > HTML files in > ISO-8859-2. Is there any way to convert them to UTF and index them? > What stream or reader have I use? Is it possible? > > How can I construct queries after that... Some systems have > ISO-8859-2 and > some systems Win-1250. > Is there any way to convert query string from default > (system) encoding to > UTF8? > > People programming ENGLISH systems are so happy... ;-) > > -- > Lukas Zapletal > http://www.tanecni-olomouc.cz/lzap > [EMAIL PROTECTED] > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> >
