Re: Hacking Luke for bytecount-based strings

Doug Cutting Wed, 17 May 2006 11:08:40 -0700

Marvin Humphrey wrote:

What I'd like to do is augment my existing patch by making it possibleto specify a particular encoding, both for Lucene and Luke.


What ensures that all documents in fact use the same encoding?

The current approach of converting everything to Unicode and thenwriting UTF-8 to indexes makes indexes portable and simplifies theconstruction of search user interfaces, since only indexing code needsto know about other character sets and encodings.

If a collection has invalidly encoded text, how does it help to detectthat later rather than sooner?

Searcheswill continue to work regardless because the patched Termbuffercompares raw bytes. (A comparison based on Term.compareTo () wouldlikely fail because raw bytes translated to UTF-8 may not produce thesame results.)

UTF-8 has the property that bytewise lexicographic order is the same asUnicode character order.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hacking Luke for bytecount-based strings

Reply via email to