On Jun 4, 2006, at 10:46 PM, David Balmain wrote: > What I > meant was that Lucy would be striving to maintain "index file format" > compatibility (which I believe was the plan).
It's funny that we haven't actually settled that. I used to think index compatibility was really important, but I don't so much any more. Index compatibility is DOA unless Lucene adopts bytecounts as string headers, because it would be insanity for Lucy to deal with the current format. So we're talking compatibility no sooner than Lucene 2.1, and adapting Lucene will be a challenge. I think the only way to make up the lost speed is to pry in the KinoSearch merge model. I strongly suspect that that will prove to be a marked improvement over not just the patched version, but the current release. However... It's a lot of work, and I think I'm the only obvious candidate with both the expertise and (maybe) the desire to do it, unless you want to take it on. Two stages out of four are complete. The bytecounts patch was stage 1, and last night I supplied stage 2: a Java port of KinoSearch's external sorting module. Stage 3 is adapting Lucene's indexing apparatus to write indexes by the segment rather than the document -- porting KinoSearch's SegWriter module and eliminating DocumentWriter and SegmentMerger would be a start. The last stage is adapting everything to be backwards compatible with char-counts as string headers. I'm not sure that I want to dedicate that much of my time to Lucene, at least not right now. The changes outlined above are pretty major. It's likely that some bugs will get introduced simply because of the volume of code change, so that's an argument against making any change at all unless there's a real benefit. There would be -- the KinoSearch merge model is faster -- but politically speaking, selling the whole package to the Lucene community would be a PITA. Not only do I have to argue that the tangible benefits justify the disruption, I have to make the argument that it's not OK for compatibility to begin and end with Java[1][2], plus deal with outright hostility and abuse from extreme Java partisans[3]. I'd rather spend my time and energy contributing to Lucy. Besides, I think that ultimately, trying to be compatible with other ports would be as much of a drag on Lucy as Lucene, and I think it's advisable for both projects to declare their file formats private. The Lucene file format is just too complex and difficult to serve as a good interchange medium. The only major reason for Lucy to be file-format-compatible with Lucene is Luke. IMO, if we want Luke's benefits, we should be hacking Luke. Marvin Humphrey Rectangular Research http://www.rectangular.com/ [1] http://xrl.us/m2o3 (Link to mail-archives.apache.org) [2] http://xrl.us/m2o7 (Link to mail-archives.apache.org) [3] http://xrl.us/m2kp (Link to mail-archives.apache.org) _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

