Re: Incremental Field Updates

Shai Erera Sun, 25 Apr 2010 04:33:19 -0700

Hi,

WARNING: following email is a bit long, but I think is worth the reading :)

I would like to describe my implementation of incremental field updates
in Juru (the former search library I've worked on in IBM). I will later
discuss how I think it can be implemented in Lucene.

The motivation/requirement came from a doc management system which used
Juru as its search component. The system included document libraries
where users could create files and upload documents. A user could belong
to any number of libraries and complex ACLs model was used (down to the
level of the file). ACLs and Folders were modeled as categories in the
index (boolean-like terms). Files and folders could be moved around and
access to a library/folder/file could be granted/revoked at any given
time. Therefore, such updates usually affected hundreds (and thousands)
of documents. Overall, the index managed millions of documents, tens of
thousands of libraries and hundreds of thousands of ACLs (large
organization :)). To get a rough understanding on the number of such
updates - every several minutes, tens of thousands of documents were
updated due to such changes only (in addition to the regular content
updates).

We were asked to support requests in the following form: "update all docs
that match <criteria> --> do <operation>" where:
* <criteria> was "a single doc", "docs belonging to a category" and "docs
belonging to a set of categories".
* <operation> was "add categories NEW" (NEW might not even exist in the
index yet, or already associated w/ the document), "remove categories OLD"
(irregardless if the docs were already associated w/ OLD or not) and
"remove all OLD and add all NEW".

The solution I've implemented to support these requests turned out to
actually allow you to update every term (!) in the index: suppose that
you have a table, which recorded tuples like <docid, term, +/->. The
record <1, "ibm", '+'> means that doc 1 is associated w/ the term "ibm",
and the record <1, "hp", '-'> means that the same document is not
associated w/ the word "hp". Then, you could very easily ask for all
documents that are assoicated w/ "hp", and the result would not include
doc 1. Note that docid=1 is not the app Doc_ID, but the internal id the
document received.

I've kept two types of postings in the index: regular and updates.
Taking the above examples, "ibm" regular posting looked like <"ibm", 1,
3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", +2,
-3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for
"hp".

During search time, when a query with the word "ibm" was submitted, I
create a virtual posting which reads from both the regular and the
updates, and merges them on the fly according to the +/- signs. Since
both postings are sorted in ascending order, the merge is very
efficient, and query time is hardly affected.

Those postings are merged from time to time in a process that is similar
to how Lucene works today, which keeps the update postings relatively
small and manageable.

Now here comes the fun part - how I think it can be implemented in Lucene !

To be honest, this sat on my TODO list for a very long time and only a
couple of months ago I figured out how to implement it in Lucene. The
main difficulty I had was around the difference between the write-once
policy in Juru and Lucene - in Lucene, once a segment is written, it
cannot be changed. BUT, I've only recently realized that this isn't
exactly true, because deleted docs do change existing segments. The
deletes are kept in a separate file to the segment (.del) and have their
own generation. Deletes, as I understood then, and Grant helped me term
them better, can be defined as "Stacked Segments" - they add data to a
segment, which from time to time are integrated into the segment (unlike
Photoshop Layers, but my understanding of Photoshop is limited). And the
Lucene engine knows how to combine the two, giving precedence to the
deletes.

By introducing an "Updates Stacked Segment", we can encode postings w/
the '+'/'-' signs, and when TermDocs/Positions is requested, we can
create a variation which merges the two lists. When segments are merged,
the updates will be merged into the regular postings (just like deletes)
and thus will be gone. In addition, this plays very nicely with readers
that are currently reading the index, as well as we can have generations
for the updates - really like deletes !

I think that Lucene's architecture allows for such a solution very
cleanly and nicely (and I believe flex makes it even easier). We can
(later, after you've digested the idea) discuss whether this should be
built into the current IW, or an extension like UpdateableIW. The API
I've been thinking about should really be like deletes, allowing to
update docs based on Term or Query. I defer the API discussion for later
for now.

As for stored fields, this was a real challenge to support in Juru, but
I think that w/ "Stacked Segments" and Lucene's architecture, this should
be much easier - adding stacked stored fields ...

As you've noticed, the update postings are not DGap encoded, and sign
needs to be preserved. While I haven't implemented it in Juru, I think
that perhaps this can be improved by keeping the '-' and '+' lists
separated. We will need to register somewhere which came before which
because order matters a lot here (and I'm not talking about concurrency
- simple update instructions order). I have some idea how this can be
achieved, but I refrain from describing it now, to not make this email
even longer :).

I've mentioned that this approach can be applied to any term and not
just categories under some circumstances. Basically, as soon as you
update a term, its DF is no longer true, unless you are able to take the
updates into account. We can defer the discussion on that, but clearly
for many fields, incrementally update them should not affect precision,
as they're not used for that type of scoring ... Maybe, by keeping
separate '+' and '-' lists we can compute statistics precisely. And I
haven't given much thought yet to how this and Mike's flex scoring will
be integrated.

BTW, a word on Parallel Indexing - the two are completely orthogonal.
Once PI is introduced, one can index all the updateable fields in a
dedicated slice, for perhaps improving search performance for slices
that are not updateable (not involving code which attempts to read and
merge update and regular lists on the fly). Also, incremental field
updates support all of PI's scenarios, even though some will be done
more efficiently w/ PI. But this too is a matter for a separate
discussion :).

That's it ! I believe I've given you all the details I have about the
approach and high level proposed solution for Lucene. Perhaps some
details slipped my mind, but if you ask the right questions, I'm sure
I'll be able to answer them :). I would like to emphasize that since
this was already implemented (in Juru) - this is more than just a "I
think this approach can work" proposal ...

I would appreciate your comments on this. I would like to start
implementing it soon, and so as a first step, please share your comments
on the overall approach. I'll then write a more detailed description on
how I think to impl it in Lucene (been spending some time on that), and
we can have more detailed (and fun) discussions on the low level
details.

Shai

On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <farh...@gmail.com> wrote:

> Good point. I meant the model at the document level: i.e. what
> milestones does a document go through in its life cycle. Today:
>
> created --> deleted
>
> With incremental updates:
>
> created --> update1 --> update2 --> deleted
>
> I think what I'm trying to say is that this second threaded sequence
> of state changes seems intuitively more fragile under concurrent
> scenarios.  So for example, in a lock-free design, the system would
> also have to anticipate the following sequence of events:
>
> created --> update1 --> deleted --> update2
>
> and consider update2 a null op.  I'm imagining there are other cases
> that I can't think of..
>
> -Babak
>
>
> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
> > write once, plus the option to the app to keep multiple commit points
> > around (by customizing the deletion policy).
> >
> > Actually order of operations / commits very much matters in Lucene today.
> >
> > Deletions are not idempotent: if you add a doc w/ term X, delete by
> > term X, add a new doc with term X... that's very different than if you
> > moved the delete op to the end.  Ie the deletion only applies to the
> > docs added before it.
> >
> > Mike
> >
> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <farh...@gmail.com>
> wrote:
> >> Sure. Because of the write once principle.  But at some cost
> >> (duplicated data). I was just agreeing that it would not be a good
> >> idea to bake in version-ing by keeping the layers around forever in a
> >> merged index; I wasn't keying in on transactions per se.
> >>
> >> Speaking of transactions: I'm not sure if we should worry about this
> >> much yet, but with "updates" the order of the transaction commits
> >> seems important. I think commit order is less important today in
> >> Lucene because its model supports only 2 types of events: document
> >> creation--which only happens once, and document deletion, which is
> >> idempotent.  What do you think? Will commits have to be ordered if we
> >> introduce updates?  Or does the onus of maintaining order fall on the
> >> application?
> >>
> >> -Babak
> >>
> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless
> >> <luc...@mikemccandless.com> wrote:
> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang <farh...@gmail.com>
> wrote:
> >>>>> I think they get merged in by the merger, ideally in the background.
> >>>>
> >>>> That sounds sensible. (In other words, we wont concern ourselves with
> >>>> roll backs--something possible while a "layer" is still around.)
> >>>
> >>> Actually roll backs would still be very possible even if layers are
> merged.
> >>>
> >>> Ie, one could keep multiple commits around, and the older commits
> >>> would still be referring to the old postings + layers, keeping them
> >>> alive.
> >>>
> >>> Lucene would still be transactional with such an approach.
> >>>
> >>> Mike
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Incremental Field Updates

Reply via email to