[ 
https://issues.apache.org/jira/browse/LUCENE-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6161:
---------------------------------------
    Attachment: LUCENE-6161.patch

Here's a hackish start at a prototype patch.  Really, this is the same
problem as intersecting doc id lists, but with sorted terms instead.

The patch pre-builds FSTs for the deleted terms to be intersected, and then 
picks
one of three methods (drive by segment's terms, drive by deleted
terms, simple merge sort) to do the intersection.

Really what would be best is if we could pass the FST to
Terms.intersect (it only takes automaton today...).  I could try
building an automaton instead, but this may be a too RAM heavy....

Tests pass, so I think it's correct, but the code is ugly/prototype.
The logic of "when to pre-build FST" needs improving...

I ran a test, indexing first 10M wikipedia docs using updateDocument
over a random ID space of 100M.  8 indexing threads, CMS, 32 MB
indexing buffer (frequent small segments written, stressing delete
packets).

Trunk applyDeletes took aggregate 159 sec (vs 117 sec with patch), and
overall indexing time for trunk was 310 sec (vs 290 sec with patch).

However, with 350 MB indexing ram buffer, applyDeletes took aggregate
129 sec (vs 94 sec with patch), yet overall indexing time was 263 sec
(vs 272 sec with patch: slower).

I may need to index with single thread, SMS, to reduce noise... but
this is slow.


> Applying deletes is sometimes dog slow
> --------------------------------------
>
>                 Key: LUCENE-6161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6161
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-6161.patch
>
>
> I hit this while testing various use cases for LUCENE-6119 (adding 
> auto-throttle to ConcurrentMergeScheduler).
> When I tested "always call updateDocument" (each add buffers a delete term), 
> with many indexing threads, opening an NRT reader once per second (forcing 
> all deleted terms to be applied), I see that 
> BufferedUpdatesStream.applyDeletes sometimes seems to take a loooong time, 
> e.g.:
> {noformat}
> BD 0 [2015-01-04 09:31:12.597; Lucene Merge Thread #69]: applyDeletes took 
> 339 msec for 10 segments, 117 deleted docs, 607333 visited terms
> BD 0 [2015-01-04 09:31:18.148; Thread-4]: applyDeletes took 5533 msec for 62 
> segments, 10989 deleted docs, 8517225 visited terms
> BD 0 [2015-01-04 09:31:21.463; Lucene Merge Thread #71]: applyDeletes took 
> 1065 msec for 10 segments, 470 deleted docs, 1825649 visited terms
> BD 0 [2015-01-04 09:31:26.301; Thread-5]: applyDeletes took 4835 msec for 61 
> segments, 14676 deleted docs, 9649860 visited terms
> BD 0 [2015-01-04 09:31:35.572; Thread-11]: applyDeletes took 6073 msec for 72 
> segments, 13835 deleted docs, 11865319 visited terms
> BD 0 [2015-01-04 09:31:37.604; Lucene Merge Thread #75]: applyDeletes took 
> 251 msec for 10 segments, 58 deleted docs, 240721 visited terms
> BD 0 [2015-01-04 09:31:44.641; Thread-11]: applyDeletes took 5956 msec for 64 
> segments, 15109 deleted docs, 10599034 visited terms
> BD 0 [2015-01-04 09:31:47.814; Lucene Merge Thread #77]: applyDeletes took 
> 396 msec for 10 segments, 137 deleted docs, 719914 visit
> {noformat}
> What this means is even though I want an NRT reader every second, often I 
> don't get one for up to ~7 or more seconds.
> This is on an SSD, machine has 48 GB RAM, heap size is only 2 GB.  12 
> indexing threads.
> As hideously complex as this code is, I think there are some inefficiencies, 
> but fixing them could be hard / make code even hairier ...
> Also, this code is mega-locked: holds IW's lock, holds BD's lock.  It blocks 
> things like merges kicking off or finishing...
> E.g., we pull the MergedIterator many times on the same set of sub-iterators. 
>  Maybe we can create the sorted terms up front and reuse that?
> Maybe we should go "term stride" (one term visits all N segments) not 
> "segment stride" (visit each segment, iterating all deleted terms for it).  
> Just iterating the terms to be deleted takes a sizable part of the time, and 
> we now do that once for every segment in the index.
> Also, the "isUnique" bit in LUCENE-6005 should help here, since if we know 
> the field is unique, we can stop seekExact once we found a segment that has 
> the deleted term, we can maybe pass false for removeDuplicates to 
> MergedIterator...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to