Improve indexing performance wrt Aggregates

Ard Schrijvers Wed, 16 Sep 2009 05:56:39 -0700

Hello,

I want to avoid recreating a lucene Document when using aggregates
over and over. This particularly holds when using aggregates for
binary data like a pdf. I think using aggregates also for binary data
is a good usecase. But, indexing and extracting a large pdf is cpu
intensive, and, this currently is done multiple times when using
aggregates. For non binary data the same holds, but the performance
penalty is lower.


I would like to add a really simple small cache into the SearchIndex:

Something like:

private final Map<String, WeakReference<Document>> lastCreatedDocs =
new LRUMap(500);

(WeakReferences used in case the Lucene Document is really large)

Now, in

protected Document  createDocument(NodeState node, NamespaceMappings
nsMappings, IndexFormatVersion indexFormatVersion)

we start with:

WeakReference<Document> refDoc =
lastCreatedDocs.get(node.getNodeId().getUUID().toString());
        Document doc;
        if(refDoc != null && (doc = refDoc.get()) != null) {
            // We already created a Document for this NodeState within
the current transaction. Return it directly.
            return doc;
        }


and before the return statement:

lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new
WeakReference<Document>(doc));

Now, when you have aggregates, then in

for mergeAggregatedNodeIndexes(node, doc);

you'll get a cached version back of the node to merge. This way,
within a *single* indexing transaction, all nodes are lucene indexed
once, and not, possibly, many times.

Obviously, there is one catch: on the next indexing transaction, we
need to flush the lastCreatedDocs cache, as a node might have been
changed. If we add a method to the QueryHandler interface, something
like

void flush()

and we call this whenever we are going to index a new set of nodes, I
think we should be fine. AFAICS if in MultiIndex

synchronized void update(Iterator remove, Iterator add) throws IOException {

we do a

handler.flush()

we are there.

What do others think? If people like it, I'll create a patch to test.

Regards Ard

Improve indexing performance wrt Aggregates

Reply via email to