Marvin Humphrey wrote:
What I'd like to do is augment my existing patch by making it possible to specify a particular encoding, both for Lucene and Luke.

What ensures that all documents in fact use the same encoding?

The current approach of converting everything to Unicode and then writing UTF-8 to indexes makes indexes portable and simplifies the construction of search user interfaces, since only indexing code needs to know about other character sets and encodings.

If a collection has invalidly encoded text, how does it help to detect that later rather than sooner?

Searches will continue to work regardless because the patched Termbuffer compares raw bytes. (A comparison based on Term.compareTo () would likely fail because raw bytes translated to UTF-8 may not produce the same results.)

UTF-8 has the property that bytewise lexicographic order is the same as Unicode character order.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to