Antony Bowesman wrote:
Hi Mike,
Unfortunately you will have to delete the old doc, then reindex a
new doc, in order to change any payloads in the document's Tokens.
This issue:
https://issues.apache.org/jira/browse/LUCENE-1231
which is still in progress, could make updating stored (but not
indexed) fields a much lower cost operation, but that's not for
sure and it's not clear when that issue will be done.
Michael Busch's Apache Con (2006/7??) presentation summarized with
the bullet
"Per-document Payloads – updateable"
Ahh -- this is just another name for "column-stride fields" (which is
the above issue I linked to).
Normal payloads are per term occurrence, ie, every position in the
document can have its own payload.
Whereas "per-document payloads" means there is a single payload per
field in the document, which logically is no different than a stored
field, except the underly storage would be more efficient (column-
stride, where that field's value for all docs is stored together vs
the normal row-stride used by current stored fields, where all field
values for a single document are stored together).
Is making a document 'updatable' (in _some_ way) something still
seen as a long term goal for Lucene?
I would say it is a goal in that there is alot of interest and
discussion around how to do this. I think LUCENE-1231 is the most
concrete recent effort & most likely to be the first path that makes
updating documents possible.
As far as implementation is concerned, if a stored (not indexed)
field may be updatable with 1231, is there some difficulty with
making payloads, which from my understanding are attributed to a
posting of an indexed field, updatable. I guess they ultimately
equate to the same thing - i.e. using a stored field to hold the
document's "payload", but it would be an extra field to load.
Updating the postings lists (freq/prx&payloads) is unfortunately quite
a bit trickier than updating a column-stride or row-stride stored
fields.
I think the approach we need to eventually take is to allow "patches"
onto a segments posting lists.
For example, segment _X would have the original large _X.frq/prx but
then could have say _X_1.frq/prx which is a much smaller file
containing postings for those docs that have been updated since the
segment was originally created. If more docs are updated that would
produce _X_2.frq/prx, etc.
IndexReaders would then need to hold open all of these postings and
dynamically "apply" the patch such that a doc's postings are iterated
from the newest frq/prx file that it exists in. Optimize() and
partial optimize() would then coalesce these files back into 1 (or
maybe a few) frq/prx files.
At least that's my current thinking on how we would approach updating
postings... but realistically these are just thoughts and are quite a
ways off from becoming a reality!
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]