On May 16, 2006, at 11:58 PM, Paul Elschot wrote:
Try and invoke luke with the a lucene jar of your choice on the
classpath before luke itself:

java -cp lucene-core-1.9-rc1-dev.jar:lukeall.jar org.getopt.luke.Luke

I tried this on an index built with KinoSearch 0.05, which pre-dates the addition of term vectors to .fdt. After working out a SecurityException by using individual components rather than lukeall.jar...

Luke powered by the patched library worked; Luke powered by straight- up Lucene did not.

The source material was stuff from Wikipedia, which contains a bunch of invalid UTF-8. KinoSearch doesn't care about that, so it's in there in the index. No problems. :)

What I'd like to do is augment my existing patch by making it possible to specify a particular encoding, both for Lucene and Luke. Searches will continue to work regardless because the patched Termbuffer compares raw bytes. (A comparison based on Term.compareTo () would likely fail because raw bytes translated to UTF-8 may not produce the same results.) That way, say, a Russian user who had built a KinoSearch index using KOI8-R (assumming I revert the .fdt change) could specify KOI8-R and have Luke display the correct characters. Ideally, you'd want to store the index's encoding in the index somewhere, but Lucene doesn't have a place for that, so I need to patch both Luke and Lucene.

I wonder how Lucene would perform with my patch applied if the indexer were spec'd to use Latin1 rather than UTF-8... patches to the segment merging apparatus would be required...

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to