On May 16, 2006, at 7:53 AM, David Balmain wrote:

> On 5/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>>
>> On May 15, 2006, at 12:08 PM, steven shingler wrote:
>>> Am I right in thinking Ferret should be able to read a Lucene
>>> generated
>>> index no problem?
>>
>> That would be nice, but it is not currently the case because of
>> Java's wacky "modified" UTF-8 serialization.  I've seen that plain
>> ol' ASCII text indexes will be compatible, but once you put in some
>> higher order characters things go askew.
>
> Hey guys,
>
> What Erik said is exactly correct. Marvin Humphrey, (author of
> KinoSearch, a Perl port of Lucene) has submitted a patch to Lucene so
> that non-java ports of Lucene will be able to read Lucene indexes. It
> currently slows Lucene down by about 25% at the moment (I think??)

Around 20% for indexing according to my benchmarker.  I don't have a  
benchmark for searching.

Modified UTF-8 is not so much the problem for performance of my  
patch, nor is it actually causing the index incompatibility in this  
case.  Modified UTF-8 is problematic for a couple other reasons.

When text contains either null bytes or Unicode code points above the  
Basic Multilingual Plane (values 2^16 and up, such as U+1D160  
"MUSICAL SYMBOL EIGHTH NOTE"), KinoSearch and Ferret, if they write  
legal UTF-8, would write indexes which would cause Lucene to crash  
from time to time with a baffling "read past EOF" error.  Therefore,  
to be Lucene-compatible they'd have to pre-scan all text to detect  
those conditions, which would impose a performance burden and require  
some crufty auxilliary code to turn the legal UTF-8 into Modified UTF-8.

Also, non-shortest-form UTF-8 presents a theoretical security risk,  
and Perl is set up to issue a warning whenever a scalar which is  
marked as UTF-8 isn't shortest-form.  That condition would occur  
whenever Modified UTF-8 containing null bytes or code points above  
the BMP was read in -- thus requiring that all incoming text be pre- 
scanned as well.

Those are rare conditions, but it isn't realistic to just say  
"KinoSearch|Ferret doesn't support null bytes or characters above the  
BMP", because a lot of times the source text that goes into an index  
isn't under the full control of the indexing/search app's author.

To be fair to Java and Lucene, they are paying a price for early  
commitment to the Unicode standard.  Lucene's UTF-8 encoding/decoding  
hasn't been touched since Doug Cutting wrote it in 1998, when non- 
shortest-form UTF-8 was still legal and Unicode was still 16-bit.   
You could argue that the Unicode consortium pulled the rug out from  
under its early champions by changing the spec so that existing  
implementations were no longer compliant.

The performance problem sof my patch and the crashing are actually  
tied to the Lucene File Format's definition of a String.  A String in  
Lucene is the length of the string in Java chars, followed by the  
character data translated to Modified UTF-8.  A String in KinoSearch,  
and if I am not mistaken in Ferret as well, is the length of the  
character data in bytes, followed by the character data.

Those two definitions of String result in identical indexes so long  
as your text is pure ASCII, but as Erik noted, when you add higher  
order characters to the mix, problems arise.  You end up reading  
either too few bytes or too many, the stream gets out of sync, and  
whammo: 'Read past EOF'.

My patch modifies Lucene to use bytecounts as the prefix to its  
Strings.  Unfortunately, there are encoding/decoding inefficiencies  
associated with the new way of doing things.  Under Lucene's current  
definition of a string you allocate an array of Java char then read  
characters into it one by one.  With the new patch, you don't know  
how many chars you need, so you might have to re-allocate several  
times.  There are ways to address that inefficiency, but they'd take  
a while to explain.

> Don't hold your
> breath though. It's going to take us a while to get it in there.

Yeah.  Modifying Lucene so that it can read both the old index format  
and the new without suffering a performance degradation in either  
case is going to be non-trivial.  I'm sympathetic to the notion that  
it may not be worth it and that Lucene should declare its file format  
private.  There are a lot of issues in play.

No KinoSearch user has yet complained about Lucene/KinoSearch file- 
format compatibility.  The only thing I miss is Luke -- which is  
significant, because Luke is really handy.

How many users here care about Lucene compatibility, and why?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to