I'd like to scan documents as they're being indexed, to find out immediately if any of them match certain queries. The goal is to find out of there are any new hits for these queries as soon as possible, without re-searching the index over and over (which would be inefficient, and higher latency). The documents still need to be indexed (not just scanned) so they can be searched later with different queries not known at index time.
The indexing throughput is in the tens of millions per day, and there are maybe a thousand queries or so to be matched. So this has to work pretty fast. (-: Fortunately the number and size of fields are both fairly small. This scanning could of course be completely decoupled from the indexing process. But my thinking was that since we already have the documents in hand, and we'll be analyzing various fields in the course of indexing, we could ideally reuse those token streams somehow for this on-the-fly scanning process. I took a look at the org.apache.lucene.index.memory.MemoryIndex class in contrib. It looks like that would work, but I'm not sure if it's the most appropriate solution (for one thing, it would have to re-analyze all the fields). Has anyone here done something similar and/or know of other classes that would be suitable? Thanks, Chris
