On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote:

> So it looks like you have intermediate things that aren't lucene
> segments, but end up producing valid lucene segments at the end of a
> session?

That's one way of thinking about it.  There's only one "thing"
though: a big bucket of serialized index entries.  At the end of a
session, those are sorted, pulled apart, and used to write the tis,
tii, frq, and prx files.

Everything else (e.g. stored fields) gets written incrementally as
documents get added.  The fact that stored fields don't get shuffled
around is one of this algorithm's advantages (along with much lower
memory requirements, etc).

Hmmm, not rewriting stored fields is nice.
I guess that could apply to anything that's strictly document
specific, such as term vectors.

> For Java lucene, I think the biggest indexing gain could be had by not
> buffering using single doc segments, but something optimized for
> in-memory single segment creation.

In theory, you could apply this technique only to a limited number of
docs and create segments, say, 10 docs at a time rather than 1 at a
time.  But then you still have to do something with each 10 doc
segment, and you don't get the benefits of less disk shuffling and
lower RAM usage.  Better to just create 1 segment per session.

One should be able to get the bulk of the benefit by buffered 1000 or
10,000 docs at a time though (with increased mem usage of course).
One problem with extending it to any number of documents is that the
complexity goes up because you can't assume it will all fit in memory.

That's fine if you're starting from scratch.... but if one looks at
lucene, which already has working segment merging to use when things
can't fit in memory, the simplest path toward greater indexing
performance is changing all the first-level merging.

Of course, if someone like you is willing to take on reworking
indexing in general, who am I to complain about the additional effort
involved ;-)

> The downside is complexity... two
> sets of "merge" code.

KS doesn't have SegmentMerger.  :)

Yeah, I was talking about the downside of my incremental plan... only
using a different strategy for the buffered docs, and use segment
merging for everything else.

Still, how do you deal with multiple sessions w/o being able to merge segments?
Do you just keep creating more and more segments?  It seems like if
you had a way to read a segment into an existing "big bucket", then
that's a segment merger.

> It would be interesting to see an IndexWriter2 for full Gordian Knot
> cutting like you do :-)

I've already contributed a Java port of KinoSearch's external sorter
(along with its tests), which is the crucial piece.  The rest isn't
easy, but stay tuned.  ;)

I definitely will.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to