Marvin Humphrey wrote:
What I'd like to do is augment my existing patch by making it possible
to specify a particular encoding, both for Lucene and Luke.
What ensures that all documents in fact use the same encoding?
The current approach of converting everything to Unicode and then
writing UTF-8 to indexes makes indexes portable and simplifies the
construction of search user interfaces, since only indexing code needs
to know about other character sets and encodings.
If a collection has invalidly encoded text, how does it help to detect
that later rather than sooner?
Searches
will continue to work regardless because the patched Termbuffer
compares raw bytes. (A comparison based on Term.compareTo () would
likely fail because raw bytes translated to UTF-8 may not produce the
same results.)
UTF-8 has the property that bytewise lexicographic order is the same as
Unicode character order.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]