On Jun 4, 2006, at 10:46 PM, David Balmain wrote:

> What I
> meant was that Lucy would be striving to maintain "index file format"
> compatibility (which I believe was the plan).

It's funny that we haven't actually settled that.  I used to think  
index compatibility was really important, but I don't so much any more.

Index compatibility is DOA unless Lucene adopts bytecounts as string  
headers, because it would be insanity for Lucy to deal with the  
current format.  So we're talking compatibility no sooner than Lucene  
2.1, and adapting Lucene will be a challenge.  I think the only way  
to make up the lost speed is to pry in the KinoSearch merge model.  I  
strongly suspect that that will prove to be a marked improvement over  
not just the patched version, but the current release.

However... It's a lot of work, and I think I'm the only obvious  
candidate with both the expertise and (maybe) the desire to do it,  
unless you want to take it on.  Two stages out of four are complete.   
The bytecounts patch was stage 1, and last night I supplied stage 2:  
a Java port of KinoSearch's external sorting module.  Stage 3 is  
adapting Lucene's indexing apparatus to write indexes by the segment  
rather than the document -- porting KinoSearch's SegWriter module and  
eliminating DocumentWriter and SegmentMerger would be a start.  The  
last stage is adapting everything to be backwards compatible with  
char-counts as string headers.

I'm not sure that I want to dedicate that much of my time to Lucene,  
at least not right now.  The changes outlined above are pretty  
major.  It's likely that some bugs will get introduced simply because  
of the volume of code change, so that's an argument against making  
any change at all unless there's a real benefit.  There would be --  
the KinoSearch merge model is faster -- but politically speaking,  
selling the whole package to the Lucene community would be a PITA.  
Not only do I have to argue that the tangible benefits justify the  
disruption, I have to make the argument that it's not OK for  
compatibility to begin and end with Java[1][2], plus deal with  
outright hostility and abuse from extreme Java partisans[3].

I'd rather spend my time and energy contributing to Lucy.  Besides, I  
think that ultimately, trying to be compatible with other ports would  
be as much of a drag on Lucy as Lucene, and I think it's advisable  
for both projects to declare their file formats private.  The Lucene  
file format is just too complex and difficult to serve as a good  
interchange medium.

The only major reason for Lucy to be file-format-compatible with  
Lucene is Luke.  IMO, if we want Luke's benefits, we should be  
hacking Luke.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] http://xrl.us/m2o3 (Link to mail-archives.apache.org)
[2] http://xrl.us/m2o7 (Link to mail-archives.apache.org)
[3] http://xrl.us/m2kp (Link to mail-archives.apache.org)


_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to