Hello,
I want to avoid recreating a lucene Document when using aggregates
over and over. This particularly holds when using aggregates for
binary data like a pdf. I think using aggregates also for binary data
is a good usecase. But, indexing and extracting a large pdf is cpu
intensive, and, this currently is done multiple times when using
aggregates. For non binary data the same holds, but the performance
penalty is lower.
I would like to add a really simple small cache into the SearchIndex:
Something like:
private final Map<String, WeakReference<Document>> lastCreatedDocs =
new LRUMap(500);
(WeakReferences used in case the Lucene Document is really large)
Now, in
protected Document createDocument(NodeState node, NamespaceMappings
nsMappings, IndexFormatVersion indexFormatVersion)
we start with:
WeakReference<Document> refDoc =
lastCreatedDocs.get(node.getNodeId().getUUID().toString());
Document doc;
if(refDoc != null && (doc = refDoc.get()) != null) {
// We already created a Document for this NodeState within
the current transaction. Return it directly.
return doc;
}
and before the return statement:
lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new
WeakReference<Document>(doc));
Now, when you have aggregates, then in
for mergeAggregatedNodeIndexes(node, doc);
you'll get a cached version back of the node to merge. This way,
within a *single* indexing transaction, all nodes are lucene indexed
once, and not, possibly, many times.
Obviously, there is one catch: on the next indexing transaction, we
need to flush the lastCreatedDocs cache, as a node might have been
changed. If we add a method to the QueryHandler interface, something
like
void flush()
and we call this whenever we are going to index a new set of nodes, I
think we should be fine. AFAICS if in MultiIndex
synchronized void update(Iterator remove, Iterator add) throws IOException {
we do a
handler.flush()
we are there.
What do others think? If people like it, I'll create a patch to test.
Regards Ard