Scanning docs at index time

Nigel Mon, 22 Feb 2010 08:30:10 -0800

I'd like to scan documents as they're being indexed, to find out immediately
if any of them match certain queries.  The goal is to find out of there are
any new hits for these queries as soon as possible, without re-searching the
index over and over (which would be inefficient, and higher latency).  The
documents still need to be indexed (not just scanned) so they can be
searched later with different queries not known at index time.


The indexing throughput is in the tens of millions per day, and there are
maybe a thousand queries or so to be matched.  So this has to work pretty
fast.  (-:  Fortunately the number and size of fields are both fairly small.

This scanning could of course be completely decoupled from the indexing
process.  But my thinking was that since we already have the documents in
hand, and we'll be analyzing various fields in the course of indexing, we
could ideally reuse those token streams somehow for this on-the-fly scanning
process.

I took a look at the org.apache.lucene.index.memory.MemoryIndex class in
contrib.  It looks like that would work, but I'm not sure if it's the most
appropriate solution (for one thing, it would have to re-analyze all the
fields).  Has anyone here done something similar and/or know of other
classes that would be suitable?

Thanks,
Chris

Scanning docs at index time

Reply via email to