[
https://issues.apache.org/jira/browse/LUCENE-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-565:
--------------------------------------
Attachment: LUCENE-565.Feb2007.patch
OK I moved NewIndexModifier's methods into IndexWriter and did some
small refactoring, tightening up protections, fixed javadocs,
indentation, etc. NewIndexModifier is now removed.
I like this solution much better!
I also increased the default number of deleted terms before a flush is
triggered from 10 to 1000. These buffered terms use very little
memory so I think it makes sense to have a larger default?
So, this adds these public methods to IndexWriter:
public void updateDocument(Term term, Document doc, Analyzer analyzer)
public void updateDocument(Term term, Document doc)
public synchronized void deleteDocuments(Term[] terms)
public synchronized void deleteDocuments(Term term)
public void setMaxBufferedDeleteTerms(int maxBufferedDeleteTerms)
public int getMaxBufferedDeleteTerms()
And this public field:
public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = 10;
On the extensions points, we had previously added these 4:
protected void doAfterFlushRamSegments(boolean flushedRamSegments)
protected boolean timeToFlushRam()
protected boolean anythingToFlushRam()
protected boolean onlyRamDocsToFlush()
I would propose that instead we add only the first one above, but
rename it to "doAfterFlush()". This is basically a callback that a
subclass could use to do its own thing after a flush but before a
commit.
But then I don't think we should add any of the others. The
"timeToFlushRam()" callback isn't really needed now that we have a
public "flush()" method. And the other two are very specific to how
IndexWriter implements RAM buffering/flushing and so unless/until we
can think of a use case that needs these I'm inclined to not include
them?
Yonik, is there something in Solr that would need these last 2
callbacks?
I've attached the patch (LUCENE-565.Feb2007.patch) with these
changes!
> Supporting deleteDocuments in IndexWriter (Code and Performance Results
> Provided)
> ---------------------------------------------------------------------------------
>
> Key: LUCENE-565
> URL: https://issues.apache.org/jira/browse/LUCENE-565
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Reporter: Ning Li
> Assigned To: Michael McCandless
> Fix For: 2.1
>
> Attachments: LUCENE-565.Feb2007.patch,
> NewIndexModifier.Jan2007.patch, NewIndexModifier.Jan2007.take2.patch,
> NewIndexModifier.Jan2007.take3.patch, NewIndexModifier.Sept21.patch,
> perf-test-res.JPG, perf-test-res2.JPG, perfres.log,
> TestBufferedDeletesPerf.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> -----------
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --------------
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> -------------------
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
> - Insert only. 1.6M documents were inserted and the final
> index size was 2.3GB.
> - Insert/delete (big batches). The same documents were
> inserted, but 25% were deleted. 1000 documents were
> deleted for every 4000 inserted.
> - Insert/delete (small batches). In this case, 5 documents
> were deleted for every 20 inserted.
> current current new
> Workload IndexWriter IndexModifier IndexWriter
> -----------------------------------------------------------------------
> Insert only 116 min 119 min 116 min
> Insert/delete (big batches) -- 135 min 125 min
> Insert/delete (small batches) -- 338 min 134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]