On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote:
So it looks like you have intermediate things that aren't lucene segments, but end up producing valid lucene segments at the end of a session?
That's one way of thinking about it. There's only one "thing" though: a big bucket of serialized index entries. At the end of a session, those are sorted, pulled apart, and used to write the tis, tii, frq, and prx files.
Everything else (e.g. stored fields) gets written incrementally as documents get added. The fact that stored fields don't get shuffled around is one of this algorithm's advantages (along with much lower memory requirements, etc).
For Java lucene, I think the biggest indexing gain could be had by not buffering using single doc segments, but something optimized for in-memory single segment creation.
In theory, you could apply this technique only to a limited number of docs and create segments, say, 10 docs at a time rather than 1 at a time. But then you still have to do something with each 10 doc segment, and you don't get the benefits of less disk shuffling and lower RAM usage. Better to just create 1 segment per session.
The downside is complexity... two sets of "merge" code.
KS doesn't have SegmentMerger. :)
It would be interesting to see an IndexWriter2 for full Gordian Knot cutting like you do :-)
I've already contributed a Java port of KinoSearch's external sorter (along with its tests), which is the crucial piece. The rest isn't easy, but stay tuned. ;)
Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]