Re: Improve indexing performance wrt Aggregates

Marcel Reutegger Thu, 17 Sep 2009 01:58:39 -0700

Hi,

In general I think such a cache makes sense, but we have to be careful
when reusing Document instances. They may contain Readers that have a
state and once consumed will probably be closed and throw an exception
on the next read.


regards
 marcel

On Wed, Sep 16, 2009 at 14:56, Ard Schrijvers <[email protected]> wrote:
> Hello,
>
> I want to avoid recreating a lucene Document when using aggregates
> over and over. This particularly holds when using aggregates for
> binary data like a pdf. I think using aggregates also for binary data
> is a good usecase. But, indexing and extracting a large pdf is cpu
> intensive, and, this currently is done multiple times when using
> aggregates. For non binary data the same holds, but the performance
> penalty is lower.
>
> I would like to add a really simple small cache into the SearchIndex:
>
> Something like:
>
> private final Map<String, WeakReference<Document>> lastCreatedDocs =
> new LRUMap(500);
>
> (WeakReferences used in case the Lucene Document is really large)
>
> Now, in
>
> protected Document  createDocument(NodeState node, NamespaceMappings
> nsMappings, IndexFormatVersion indexFormatVersion)
>
> we start with:
>
> WeakReference<Document> refDoc =
> lastCreatedDocs.get(node.getNodeId().getUUID().toString());
>        Document doc;
>        if(refDoc != null && (doc = refDoc.get()) != null) {
>            // We already created a Document for this NodeState within
> the current transaction. Return it directly.
>            return doc;
>        }
>
>
> and before the return statement:
>
> lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new
> WeakReference<Document>(doc));
>
> Now, when you have aggregates, then in
>
> for mergeAggregatedNodeIndexes(node, doc);
>
> you'll get a cached version back of the node to merge. This way,
> within a *single* indexing transaction, all nodes are lucene indexed
> once, and not, possibly, many times.
>
> Obviously, there is one catch: on the next indexing transaction, we
> need to flush the lastCreatedDocs cache, as a node might have been
> changed. If we add a method to the QueryHandler interface, something
> like
>
> void flush()
>
> and we call this whenever we are going to index a new set of nodes, I
> think we should be fine. AFAICS if in MultiIndex
>
> synchronized void update(Iterator remove, Iterator add) throws IOException {
>
> we do a
>
> handler.flush()
>
> we are there.
>
> What do others think? If people like it, I'll create a patch to test.
>
> Regards Ard
>

Re: Improve indexing performance wrt Aggregates

Reply via email to