On May 16, 2006, at 7:53 AM, David Balmain wrote: > On 5/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: >> >> On May 15, 2006, at 12:08 PM, steven shingler wrote: >>> Am I right in thinking Ferret should be able to read a Lucene >>> generated >>> index no problem? >> >> That would be nice, but it is not currently the case because of >> Java's wacky "modified" UTF-8 serialization. I've seen that plain >> ol' ASCII text indexes will be compatible, but once you put in some >> higher order characters things go askew. > > Hey guys, > > What Erik said is exactly correct. Marvin Humphrey, (author of > KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so > that non-java ports of Lucene will be able to read Lucene indexes. It > currently slows Lucene down by about 25% at the moment (I think??)
Around 20% for indexing according to my benchmarker. I don't have a benchmark for searching. Modified UTF-8 is not so much the problem for performance of my patch, nor is it actually causing the index incompatibility in this case. Modified UTF-8 is problematic for a couple other reasons. When text contains either null bytes or Unicode code points above the Basic Multilingual Plane (values 2^16 and up, such as U+1D160 "MUSICAL SYMBOL EIGHTH NOTE"), KinoSearch and Ferret, if they write legal UTF-8, would write indexes which would cause Lucene to crash from time to time with a baffling "read past EOF" error. Therefore, to be Lucene-compatible they'd have to pre-scan all text to detect those conditions, which would impose a performance burden and require some crufty auxilliary code to turn the legal UTF-8 into Modified UTF-8. Also, non-shortest-form UTF-8 presents a theoretical security risk, and Perl is set up to issue a warning whenever a scalar which is marked as UTF-8 isn't shortest-form. That condition would occur whenever Modified UTF-8 containing null bytes or code points above the BMP was read in -- thus requiring that all incoming text be pre- scanned as well. Those are rare conditions, but it isn't realistic to just say "KinoSearch|Ferret doesn't support null bytes or characters above the BMP", because a lot of times the source text that goes into an index isn't under the full control of the indexing/search app's author. To be fair to Java and Lucene, they are paying a price for early commitment to the Unicode standard. Lucene's UTF-8 encoding/decoding hasn't been touched since Doug Cutting wrote it in 1998, when non- shortest-form UTF-8 was still legal and Unicode was still 16-bit. You could argue that the Unicode consortium pulled the rug out from under its early champions by changing the spec so that existing implementations were no longer compliant. The performance problem sof my patch and the crashing are actually tied to the Lucene File Format's definition of a String. A String in Lucene is the length of the string in Java chars, followed by the character data translated to Modified UTF-8. A String in KinoSearch, and if I am not mistaken in Ferret as well, is the length of the character data in bytes, followed by the character data. Those two definitions of String result in identical indexes so long as your text is pure ASCII, but as Erik noted, when you add higher order characters to the mix, problems arise. You end up reading either too few bytes or too many, the stream gets out of sync, and whammo: 'Read past EOF'. My patch modifies Lucene to use bytecounts as the prefix to its Strings. Unfortunately, there are encoding/decoding inefficiencies associated with the new way of doing things. Under Lucene's current definition of a string you allocate an array of Java char then read characters into it one by one. With the new patch, you don't know how many chars you need, so you might have to re-allocate several times. There are ways to address that inefficiency, but they'd take a while to explain. > Don't hold your > breath though. It's going to take us a while to get it in there. Yeah. Modifying Lucene so that it can read both the old index format and the new without suffering a performance degradation in either case is going to be non-trivial. I'm sympathetic to the notion that it may not be worth it and that Lucene should declare its file format private. There are a lot of issues in play. No KinoSearch user has yet complained about Lucene/KinoSearch file- format compatibility. The only thing I miss is Luke -- which is significant, because Luke is really handy. How many users here care about Lucene compatibility, and why? Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

