[
https://issues.apache.org/jira/browse/LUCENE-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-6161:
---------------------------------------
Attachment: LUCENE-6161.patch
Another patch, this one using DaciukMihovAutomatonBuilder to create an
automaton from the terms to delete, and then using Terms.intersect.
This one spends even less time applying deletes (46 sec vs 129 sec on trunk)
yet overall indexing time is still a bit slower (272 sec vs 263 on trunk).
I also fixed Automaton to implement Accountable ...
> Applying deletes is sometimes dog slow
> --------------------------------------
>
> Key: LUCENE-6161
> URL: https://issues.apache.org/jira/browse/LUCENE-6161
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-6161.patch, LUCENE-6161.patch
>
>
> I hit this while testing various use cases for LUCENE-6119 (adding
> auto-throttle to ConcurrentMergeScheduler).
> When I tested "always call updateDocument" (each add buffers a delete term),
> with many indexing threads, opening an NRT reader once per second (forcing
> all deleted terms to be applied), I see that
> BufferedUpdatesStream.applyDeletes sometimes seems to take a loooong time,
> e.g.:
> {noformat}
> BD 0 [2015-01-04 09:31:12.597; Lucene Merge Thread #69]: applyDeletes took
> 339 msec for 10 segments, 117 deleted docs, 607333 visited terms
> BD 0 [2015-01-04 09:31:18.148; Thread-4]: applyDeletes took 5533 msec for 62
> segments, 10989 deleted docs, 8517225 visited terms
> BD 0 [2015-01-04 09:31:21.463; Lucene Merge Thread #71]: applyDeletes took
> 1065 msec for 10 segments, 470 deleted docs, 1825649 visited terms
> BD 0 [2015-01-04 09:31:26.301; Thread-5]: applyDeletes took 4835 msec for 61
> segments, 14676 deleted docs, 9649860 visited terms
> BD 0 [2015-01-04 09:31:35.572; Thread-11]: applyDeletes took 6073 msec for 72
> segments, 13835 deleted docs, 11865319 visited terms
> BD 0 [2015-01-04 09:31:37.604; Lucene Merge Thread #75]: applyDeletes took
> 251 msec for 10 segments, 58 deleted docs, 240721 visited terms
> BD 0 [2015-01-04 09:31:44.641; Thread-11]: applyDeletes took 5956 msec for 64
> segments, 15109 deleted docs, 10599034 visited terms
> BD 0 [2015-01-04 09:31:47.814; Lucene Merge Thread #77]: applyDeletes took
> 396 msec for 10 segments, 137 deleted docs, 719914 visit
> {noformat}
> What this means is even though I want an NRT reader every second, often I
> don't get one for up to ~7 or more seconds.
> This is on an SSD, machine has 48 GB RAM, heap size is only 2 GB. 12
> indexing threads.
> As hideously complex as this code is, I think there are some inefficiencies,
> but fixing them could be hard / make code even hairier ...
> Also, this code is mega-locked: holds IW's lock, holds BD's lock. It blocks
> things like merges kicking off or finishing...
> E.g., we pull the MergedIterator many times on the same set of sub-iterators.
> Maybe we can create the sorted terms up front and reuse that?
> Maybe we should go "term stride" (one term visits all N segments) not
> "segment stride" (visit each segment, iterating all deleted terms for it).
> Just iterating the terms to be deleted takes a sizable part of the time, and
> we now do that once for every segment in the index.
> Also, the "isUnique" bit in LUCENE-6005 should help here, since if we know
> the field is unique, we can stop seekExact once we found a segment that has
> the deleted term, we can maybe pass false for removeDuplicates to
> MergedIterator...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]