Antony Bowesman wrote:

Hi Mike,

Unfortunately you will have to delete the old doc, then reindex a new doc, in order to change any payloads in the document's Tokens.
This issue:
   https://issues.apache.org/jira/browse/LUCENE-1231
which is still in progress, could make updating stored (but not indexed) fields a much lower cost operation, but that's not for sure and it's not clear when that issue will be done.

Michael Busch's Apache Con (2006/7??) presentation summarized with the bullet

"Per-document Payloads – updateable"

Ahh -- this is just another name for "column-stride fields" (which is the above issue I linked to).

Normal payloads are per term occurrence, ie, every position in the document can have its own payload.

Whereas "per-document payloads" means there is a single payload per field in the document, which logically is no different than a stored field, except the underly storage would be more efficient (column- stride, where that field's value for all docs is stored together vs the normal row-stride used by current stored fields, where all field values for a single document are stored together).

Is making a document 'updatable' (in _some_ way) something still seen as a long term goal for Lucene?

I would say it is a goal in that there is alot of interest and discussion around how to do this. I think LUCENE-1231 is the most concrete recent effort & most likely to be the first path that makes updating documents possible.

As far as implementation is concerned, if a stored (not indexed) field may be updatable with 1231, is there some difficulty with making payloads, which from my understanding are attributed to a posting of an indexed field, updatable. I guess they ultimately equate to the same thing - i.e. using a stored field to hold the document's "payload", but it would be an extra field to load.

Updating the postings lists (freq/prx&payloads) is unfortunately quite a bit trickier than updating a column-stride or row-stride stored fields.

I think the approach we need to eventually take is to allow "patches" onto a segments posting lists.

For example, segment _X would have the original large _X.frq/prx but then could have say _X_1.frq/prx which is a much smaller file containing postings for those docs that have been updated since the segment was originally created. If more docs are updated that would produce _X_2.frq/prx, etc.

IndexReaders would then need to hold open all of these postings and dynamically "apply" the patch such that a doc's postings are iterated from the newest frq/prx file that it exists in. Optimize() and partial optimize() would then coalesce these files back into 1 (or maybe a few) frq/prx files.

At least that's my current thinking on how we would approach updating postings... but realistically these are just thoughts and are quite a ways off from becoming a reality!

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to