Re: Fast index traversal and update for stored field?

Erick Erickson Wed, 14 Mar 2007 15:51:11 -0800

If you search the mail archive for "update in place" (no quotes),
you'll find extensive discussions of this idea. Although you're
raising an interesting variant because you're talking about a non-
indexed field, so now I'm not sure those discussions are relevant.


I don't know of anyone who has done what you're asking though...

But if it's just stored data, you could go out to a database and
pick it up at search time, although there are sound reasons for
not requiring a database connection.

What about having a separate index for just this one field? And
make it an indexed value, along with some id (not the Lucene ID,
probably) of your original. Something like

index fields
ID  (unique ID for each document)
field (the corresponding value).

Searching this should be very fast, and if the usual Hits based
search wasn't fast enough, perhaps something with
termenum/termdocs would be faster.

Or you could just index the unique ID and store (but not index)
the field. Hits or variants should work for that too.

So the general algorithm would be:

search main index
for each hit:
  search second index and fetch that field

I have no idea whether this has any traction for your problem
space, but I thought I'd mention it. This assumes that building
the mutable index would be acceptably fast...

Although conceptually, this is really just a Map of ID/value pairs.
I have no idea how much data you're talking about, but if it's not
a huge data set, might it be possible just to store it in
a simple map and look it up that way?

And if I'm all wet, I'm sure others will chime in...

Best
Erick
*

*
On 3/14/07, Thomas K. Burkholder <[EMAIL PROTECTED]> wrote:


Hi there,

I'm using lucene to index and store entries from a database table for
ultimate retrieval as search results.  This works fine.  But I find
myself in the position of wanting to occasionally (daily-ish) bulk-
update a single, stored, non-indexed field in every document in the
index, without changing any indexed value at all.

The obviously documented way to do this would be to remove and then
re-add each updated document successively.  However, I know from
experience that rebuilding our index from scratch in this fashion
would take several hours at least, which is too long to delay pending
incremental index jobs.  It seems to me that at some level it should
be possible to iterate over all the document storage on disk and
modify only the field I'm interested in (no index modification
required remember as this is a field that is stored but not
indexed).  It's plain from the documentation on file formats that it
would be potentially possible to do this from a low level, however
before I go possibly re-inventing that wheel, I'm wondering if anyone
knows of any existing code out there that would aid in solving this
problem.

Thanks in advance,

//Thomas
Thomas K. Burkholder
Code Janitor

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fast index traversal and update for stored field?

Reply via email to