On May 16, 2006, at 11:58 PM, Paul Elschot wrote:
Try and invoke luke with the a lucene jar of your choice on the
classpath before luke itself:
java -cp lucene-core-1.9-rc1-dev.jar:lukeall.jar org.getopt.luke.Luke
I tried this on an index built with KinoSearch 0.05, which pre-dates
the addition of term vectors to .fdt. After working out a
SecurityException by using individual components rather than
lukeall.jar...
Luke powered by the patched library worked; Luke powered by straight-
up Lucene did not.
The source material was stuff from Wikipedia, which contains a bunch
of invalid UTF-8. KinoSearch doesn't care about that, so it's in
there in the index. No problems. :)
What I'd like to do is augment my existing patch by making it
possible to specify a particular encoding, both for Lucene and Luke.
Searches will continue to work regardless because the patched
Termbuffer compares raw bytes. (A comparison based on Term.compareTo
() would likely fail because raw bytes translated to UTF-8 may not
produce the same results.) That way, say, a Russian user who had
built a KinoSearch index using KOI8-R (assumming I revert the .fdt
change) could specify KOI8-R and have Luke display the correct
characters. Ideally, you'd want to store the index's encoding in the
index somewhere, but Lucene doesn't have a place for that, so I need
to patch both Luke and Lucene.
I wonder how Lucene would perform with my patch applied if the
indexer were spec'd to use Latin1 rather than UTF-8... patches to
the segment merging apparatus would be required...
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]