On Mon, Mar 16, 2009 at 07:02:05PM -0400, Michael McCandless wrote:

> OK; I guess the salient question is how important is it, really, for
> deletes to apply to docs added in the current session.  Do users ever
> miss that?

It's been noted.  I don't think anyone thinks the current KS behavior is
better.

> Right, OK.  Though Lucene takes a similar approach to KS, it's just
> that we allow the intermediate segments (runs?) to be committed as
> segments, and then normal merging merges them (ie, the doc store files
> are shared across segments for a single session).

I think we can make that work for Lucy, too.  The KS external sorter now
writes temporary files using the standard lexicon and postings formats.  It's
flushing these to "runs", but there's not much of a difference between these
runs and actual segments.

Plus, the Snapshot class I'm about to propose for Lucy is specifically
designed to accommodate extra-segment files like the docs stores.

Cross-pollination FTW. :)

> > Yeah.  I think if we cap the deletion percentage at 10% and merge
> > any segment that exceeds that rate, plus apply deletions at the
> > Scorer_Collect level, tombstones won't absolutely murder us.  But
> > there's some work to do there.
> 
> Run some C perf tests first!

For now I'll just put it off and use bit vector deletions.  I only need one
writer and one consolidator at present.  Total throughput isn't a problem --
worst-case performance is what's driving this feature.

> Interesting, Lucy will do GC of the search objects?  Is this used for
> all objects, eg Document, Field, Term, Query, etc.?

It's basically a straight-up refcounting model: each object keeps its own
refcount, call Destroy() as soon as the refcount hits zero.  All full-fledged
objects are refcounted.

There are some quirks, though, with how it manages host objects.  The default
behavior is to create a host language object at the same time as the
Boilerplater object, and have the host object manage the refcount.  

However, there are a few classes -- CharBuf, VTable, and Posting among them --
that subclass "FastObj", which has its own refcount, and creates a host
wrapper each time it crosses into host space.

The second member in the Obj struct is a union:

    typedef union boil_ref_t {
        chy_u32_t count;
        void *host_obj;
    } boil_ref_t;

FastObj keeps track of its refcounts using self->ref.count; Obj delegates to
self->ref.host_obj.  It's up to the binding to provide implementations for
Obj_Inc_RefCount and Obj_Dec_RefCount.

I haven't worked out the details on how the refcounted model is going to
function within GC systems like Java and Ruby.  However, I think writing a
full-featured tracing garbage collector that's supposed to work within a
reference counted system like Perl's would be harder.

I've tried searching the web for resources on how to make refcounting and GC
coexist happily, but I haven't found anything so far.  If anybody's got strong
google-fu today or actually knows and can recommend some literature, I'm all
ears.

> I've been wondering how we can take advantage of concurrency within a single
> search in Lucene.

Well, doesn't it at least make logical sense to dedicate one thread per
segment during search?  Then you just have to adopt a different definition for
"optimize": instead of an "optimized" index always meaning "one segment", it
means "one segment per thread".

Marvin Humphrey

Reply via email to