[ 
http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12457833 ] 
            
Yonik Seeley commented on LUCENE-565:
-------------------------------------

On 12/12/06, Ning Li <[EMAIL PROTECTED]> wrote:
> > To minimize the number of reader open/closes on large persistent segments, 
> > I think the ability to apply deletes only before a merge is important.  
> > That might add a 4th method: doBeforeMerge()
> 
> I'm not sure I get this. Buffered deletes are only applied(flushed)
> during ram flush. No buffered deletes are applied in the merges of
> on-disk segments.

What is important is to be able to apply deletes before any ids change.
You could do it after every new lowest-level segment is written to the index 
(the flush), *or* you could choose to do it before a merge of the lowest level 
on-disk segments.  If none of the lowest level segments have deletes, you could 
even defer the deletes until after all the lowest-level segments have been 
merged.  This makes the deletes more efficient since it goes from O(mergeFactor 
* log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs))

If we can't reuse IndexReaders, this becomes more important.

One could perhaps choose to defer deletes until a segment with deleted docs is 
involved in a merge.
 
> > It would be nice to not have to continually open and close readers on 
> > segments that aren't involved in a merge.  Is there a way to do this?
> > If SegmentInfos had a cached reader, that seems like it would solve both 
> > problems.
> > I haven't thought about it enough to figure out how doable it is though.
> 
> This is a good idea! One concern, however, is that caching readers
> will cause a larger memory footprint. Is it acceptable?

As I said, I haven't had time to think about it at all, but at the lowest level 
of reuse, it wouldn't increase the footprint at all in the event that deletes 
are deferred until a merge:

The specific scenario I'm thinking of is instead of
  doAfterFlushRamSegments()
    open readers
    delete docs
    close readers
  segmentMerger()
    open readers
    merge segments
    close readers

It would be:
  doAfterFlushRamSegments()
    open readers
    delete docs
  segmentMerger()
    merge segments
    close readers

This cutting out an additional open-close cycle.
You are right that other forms of reader caching could increase the footprint, 
but it's nice to have the option of trading some memory for performance.

Yet another strategy a subclass of IndexWriter could choose is to only apply 
deletes to segments actually involved in a merge.  Then the bigger segments in 
the index wouldn't continually have an reader opened and closed on them.... it 
could all be deferred until a close, or until there are too many deletes 
buffered.

Of course NewIndexModifier doesn't have to impliment all these options to start 
with, but it would be nice if the extension hooks in IndexWriter could support 
them.

Whew, this is why I was slow to get involved in this again :-)

> Supporting deleteDocuments in IndexWriter (Code and Performance Results 
> Provided)
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-565
>                 URL: http://issues.apache.org/jira/browse/LUCENE-565
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Ning Li
>         Attachments: IndexWriter.java, IndexWriter.July09.patch, 
> IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, 
> NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, 
> NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, 
> newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, 
> perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> -----------
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --------------
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> -------------------
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
>     index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
>     inserted, but 25% were deleted. 1000 documents were
>     deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
>     were deleted for every 20 inserted.
>                                 current       current          new
> Workload                      IndexWriter  IndexModifier   IndexWriter
> -----------------------------------------------------------------------
> Insert only                     116 min       119 min        116 min
> Insert/delete (big batches)       --          135 min        125 min
> Insert/delete (small batches)     --          338 min        134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to