Re: Hacking Luke for bytecount-based strings

Marvin Humphrey Wed, 17 May 2006 10:49:57 -0700


On May 16, 2006, at 11:58 PM, Paul Elschot wrote:

Try and invoke luke with the a lucene jar of your choice on the
classpath before luke itself:


java -cp lucene-core-1.9-rc1-dev.jar:lukeall.jar org.getopt.luke.Luke

I tried this on an index built with KinoSearch 0.05, which pre-datesthe addition of term vectors to .fdt. After working out aSecurityException by using individual components rather thanlukeall.jar...

Luke powered by the patched library worked; Luke powered by straight-up Lucene did not.

The source material was stuff from Wikipedia, which contains a bunchof invalid UTF-8. KinoSearch doesn't care about that, so it's inthere in the index. No problems. :)

What I'd like to do is augment my existing patch by making itpossible to specify a particular encoding, both for Lucene and Luke.Searches will continue to work regardless because the patchedTermbuffer compares raw bytes. (A comparison based on Term.compareTo() would likely fail because raw bytes translated to UTF-8 may notproduce the same results.) That way, say, a Russian user who hadbuilt a KinoSearch index using KOI8-R (assumming I revert the .fdtchange) could specify KOI8-R and have Luke display the correctcharacters. Ideally, you'd want to store the index's encoding in theindex somewhere, but Lucene doesn't have a place for that, so I needto patch both Luke and Lucene.

I wonder how Lucene would perform with my patch applied if theindexer were spec'd to use Latin1 rather than UTF-8... patches tothe segment merging apparatus would be required...


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hacking Luke for bytecount-based strings

Reply via email to