RE: Converting ISO88592 files to UTF8 and indexing`em

Sale, Doug Fri, 03 Jan 2003 08:07:04 -0800

lukas,

I believe what i'm about to tell you is correct... ;]


INDEXING:

UTF-8 is an encoding of Unicode chars, as is UCS-2.  Java uses UCS-2
internally.  So once the ISO-8859-2 characters are read in from disk
(specifying the proper encoding), they are in a Unicode format (UCS-2) in
memory.  If you are using HTML character entities in the original content,
they need to be converted to their unicode equivalent after reading in from
disk.

The different Unicode encodings (UTF-8, UTF-16, UCS-2, etc) use different
lengths and byte orders for their character encodings, so you'll have to
check what the byte order and char length is for the UCS-2 encoding...
(&aelig; -> 0x00E6 - assuming UCS-2 is 2 bytes with the leftmost being the
high-order byte)

Lucene writes out its index files in UTF-8 - you don't need to worry about
this.   (just know that it supports all the chars you'll want to use.)

SEARCHING

When reading the query string, you need to specify the correct encoding for
the specific platform (otherwise the chars will get mapped incorrectly).  As
long as they're read into Java correctly, everything should be ok.  This
might be a problem if the search is web-based.  the browser, server, etc.
have to negotiate a common encoding, which might not support certain chars
you need to search on.  you could use char entities for these and convert
them like you did when reading the HTML files.  this requires a special user
interface to enter the entities.

ANALYZERS

Whatever analyzer you use should use unicode chars as well.  Unicode
literals may be embedded in Java source files as '\u00E6'.  you will want to
include the HTML character entity conversion in the analyzer, to ensure it's
done the same way in both places (indexing/searching).

-doug



> -----Original Message-----
> From: Lukas Zapletal [mailto:[EMAIL PROTECTED]]
> Sent: Friday, January 03, 2003 4:19 AM
> To: [EMAIL PROTECTED]
> Subject: Converting ISO88592 files to UTF8 and indexing`em
> 
> 
> Dears,
> 
> I have a problem. I need to index Czech content that is in 
> HTML files in 
> ISO-8859-2. Is there any way to convert them to UTF and index them?
> What stream or reader have I use? Is it possible?
> 
> How can I construct queries after that... Some systems have 
> ISO-8859-2 and 
> some systems Win-1250.
> Is there any way to convert query string from default 
> (system) encoding to 
> UTF8?
> 
> People programming ENGLISH systems are so happy... ;-)
> 
> -- 
> Lukas Zapletal
> http://www.tanecni-olomouc.cz/lzap
> [EMAIL PROTECTED]
> 
> --
> To unsubscribe, e-mail:   
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: 
> <mailto:[EMAIL PROTECTED]>
>

RE: Converting ISO88592 files to UTF8 and indexing`em

Reply via email to