[
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4752:
---------------------------------
Attachment: sorting_10M_ingestion.log
natural_10M_ingestion.log
LUCENE-4752.patch
bq. Maybe just put a comment in IW where it calls merge.getReaders() why we
don't access the readers list directly
Done.
bq. I started working on this (LUCENE-4830 for memory and LUCENE-4839 for
complexity) and will run some indexing benchmarks with the Wikipedia corpus to
see how it behaves compared to natural merging.
Now that SortingAtomicReader uses TimSort to compute the doc ID mapping and
sort postigs lists, using SortingMergePolicy only increases the merge
complexity by constant factors compared to a natural merge if the readers to
merge are sorted (I'm assuming the number of segments to merge is bounded). I
think this makes online sorting a viable option.
I ran some indexing benchmarks to see how slower indexing is with
SortingMergePolicy. To do this I quickly patched luceneutil to add a random
NumericDocValuesField to all documents and wrap the merge policy with
SortingMergePolicy. Indexing 10M docs from the wikimedium collection was 2x
slower with SortingMergePolicy (see ingestion rate logs attached). To measure
pure merge performance, I ran a forceMerge(1) on those indexes and
SortingMergePolicy made this forceMerge 3.5x slower (856415 ms vs 250054 ms).
If you're curious, here is where the merging time is spent with
SortingMergePolicy according to my profiler:
- 32%: CompressingStoredField.visitDocument (vs. < 1% when using a regular
merge policy)
- 17%: TimSort: to sort the doc mapping and postings lists
- 6%: Sorter.DocMap.oldToNew: used by SortingDocsEnum to map the old IDs to
the new ones
Most of the time is not spent into actual sorting but in visitDocument because
the codec-specific merge routine can't be used, so the stored fields format
decompresses every chunk multiple times (a few hundred times given that my
docs are really small, this would be less noticeable with larger docs).
I think it's close, what do you think?
> Merge segments to sort them
> ---------------------------
>
> Key: LUCENE-4752
> URL: https://issues.apache.org/jira/browse/LUCENE-4752
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/index
> Reporter: David Smiley
> Assignee: Adrien Grand
> Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch,
> LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log,
> sorting_10M_ingestion.log
>
>
> It would be awesome if Lucene could write the documents out in a segment
> based on a configurable order. This of course applies to merging segments
> to. The benefit is increased locality on disk of documents that are likely to
> be accessed together. This often applies to documents near each other in
> time, but also spatially.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]