[
https://issues.apache.org/jira/browse/LUCENE-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated LUCENE-8580:
--------------------------------
Attachment: LUCENE-8580.patch
> Make segment merging parallel in SegmentMerger
> ----------------------------------------------
>
> Key: LUCENE-8580
> URL: https://issues.apache.org/jira/browse/LUCENE-8580
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Attachments: LUCENE-8580.patch
>
>
> A placeholder issue stemming from the discussion on the mailing list [1]. Not
> of any high priority.
> At the moment any merging from N segments into one will happen sequentially
> for each data structure involved in a segment (postings, norms, points,
> etc.). If the input segments are large, the CPU (and I/O) are mostly unused
> and the process takes a long time.
> Merging of these data structures is mostly independent of each other, so it'd
> be interesting to see if we can speed things up by allowing them to run
> concurrently. I investigated this on a 40GB index with 22 segments,
> force-merging this into 1 segment (of similar size). Quick and dirty patch
> attached.
> I see some improvement, although it's not by much; the largest component
> dominates everything else.
> Results from an 8-core CPU.
> Before:
> {code}
> SM 0 [2018-11-30T09:21:11.662Z; main]: 347237 msec to merge stored fields
> [41922110 docs]
> SM 0 [2018-11-30T09:21:18.236Z; main]: 6562 msec to merge norms [41922110
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 755507 msec to merge postings
> [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge doc values [41922110
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge points [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 7 msec to write field infos [41922110
> docs]
> IW 0 [2018-11-30T09:33:56.124Z; main]: merge time 1112238 msec for 41922110
> docs
> {code}
> After:
> {code}
> SM 0 [2018-11-30T10:16:42.179Z; ForkJoinPool.commonPool-worker-1]: 8189 msec
> to merge norms
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to
> merge doc values
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to
> merge points
> SM 0 [2018-11-30T10:16:42.211Z; ForkJoinPool.commonPool-worker-1]: merge
> store matchedCount=22 vs 22
> SM 0 [2018-11-30T10:23:24.574Z; ForkJoinPool.commonPool-worker-1]: 402381
> msec to merge stored fields [41922110 docs]
> SM 0 [2018-11-30T10:32:20.862Z; ForkJoinPool.commonPool-worker-2]: 938668
> msec to merge postings
> IW 0 [2018-11-30T10:32:23.513Z; main]: merge time 950249 msec for 41922110
> docs
> {code}
> Ideally, one would need to push forkjoin into individual subroutines so that,
> for example, postings utilize concurrency when merging (pulling blocks of
> terms concurrently from the input, calculating statistics, etc. and then
> pushing in an ordered fashion to the codec).
> [1] https://markmail.org/thread/dtejwq42qagykeac
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]