Re: Modifying a document by updating a payloads?

Michael McCandless Thu, 31 Jul 2008 02:34:36 -0700


Antony Bowesman wrote:

Hi Mike,
Unfortunately you will have to delete the old doc, then reindex anew doc, in order to change any payloads in the document's Tokens.
This issue:
   https://issues.apache.org/jira/browse/LUCENE-1231
which is still in progress, could make updating stored (but notindexed) fields a much lower cost operation, but that's not forsure and it's not clear when that issue will be done.
Michael Busch's Apache Con (2006/7??) presentation summarized withthe bullet
"Per-document Payloads – updateable"

Ahh -- this is just another name for "column-stride fields" (which isthe above issue I linked to).

Normal payloads are per term occurrence, ie, every position in thedocument can have its own payload.

Whereas "per-document payloads" means there is a single payload perfield in the document, which logically is no different than a storedfield, except the underly storage would be more efficient (column-stride, where that field's value for all docs is stored together vsthe normal row-stride used by current stored fields, where all fieldvalues for a single document are stored together).

Is making a document 'updatable' (in _some_ way) something stillseen as a long term goal for Lucene?

I would say it is a goal in that there is alot of interest anddiscussion around how to do this. I think LUCENE-1231 is the mostconcrete recent effort & most likely to be the first path that makesupdating documents possible.

As far as implementation is concerned, if a stored (not indexed)field may be updatable with 1231, is there some difficulty withmaking payloads, which from my understanding are attributed to aposting of an indexed field, updatable. I guess they ultimatelyequate to the same thing - i.e. using a stored field to hold thedocument's "payload", but it would be an extra field to load.

Updating the postings lists (freq/prx&payloads) is unfortunately quitea bit trickier than updating a column-stride or row-stride storedfields.

I think the approach we need to eventually take is to allow "patches"onto a segments posting lists.

For example, segment _X would have the original large _X.frq/prx butthen could have say _X_1.frq/prx which is a much smaller filecontaining postings for those docs that have been updated since thesegment was originally created. If more docs are updated that wouldproduce _X_2.frq/prx, etc.

IndexReaders would then need to hold open all of these postings anddynamically "apply" the patch such that a doc's postings are iteratedfrom the newest frq/prx file that it exists in. Optimize() andpartial optimize() would then coalesce these files back into 1 (ormaybe a few) frq/prx files.

At least that's my current thinking on how we would approach updatingpostings... but realistically these are just thoughts and are quite aways off from becoming a reality!


Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Modifying a document by updating a payloads?

Reply via email to