Hi, In general I think such a cache makes sense, but we have to be careful when reusing Document instances. They may contain Readers that have a state and once consumed will probably be closed and throw an exception on the next read.
regards marcel On Wed, Sep 16, 2009 at 14:56, Ard Schrijvers <[email protected]> wrote: > Hello, > > I want to avoid recreating a lucene Document when using aggregates > over and over. This particularly holds when using aggregates for > binary data like a pdf. I think using aggregates also for binary data > is a good usecase. But, indexing and extracting a large pdf is cpu > intensive, and, this currently is done multiple times when using > aggregates. For non binary data the same holds, but the performance > penalty is lower. > > I would like to add a really simple small cache into the SearchIndex: > > Something like: > > private final Map<String, WeakReference<Document>> lastCreatedDocs = > new LRUMap(500); > > (WeakReferences used in case the Lucene Document is really large) > > Now, in > > protected Document createDocument(NodeState node, NamespaceMappings > nsMappings, IndexFormatVersion indexFormatVersion) > > we start with: > > WeakReference<Document> refDoc = > lastCreatedDocs.get(node.getNodeId().getUUID().toString()); > Document doc; > if(refDoc != null && (doc = refDoc.get()) != null) { > // We already created a Document for this NodeState within > the current transaction. Return it directly. > return doc; > } > > > and before the return statement: > > lastCreatedDocs.put(node.getNodeId().getUUID().toString(), new > WeakReference<Document>(doc)); > > Now, when you have aggregates, then in > > for mergeAggregatedNodeIndexes(node, doc); > > you'll get a cached version back of the node to merge. This way, > within a *single* indexing transaction, all nodes are lucene indexed > once, and not, possibly, many times. > > Obviously, there is one catch: on the next indexing transaction, we > need to flush the lastCreatedDocs cache, as a node might have been > changed. If we add a method to the QueryHandler interface, something > like > > void flush() > > and we call this whenever we are going to index a new set of nodes, I > think we should be fine. AFAICS if in MultiIndex > > synchronized void update(Iterator remove, Iterator add) throws IOException { > > we do a > > handler.flush() > > we are there. > > What do others think? If people like it, I'll create a patch to test. > > Regards Ard >
