[jira] [Updated] (LUCENE-4752) Merge segments to sort them

Adrien Grand (JIRA) Mon, 18 Mar 2013 18:09:17 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-4752:
---------------------------------

    Attachment: sorting_10M_ingestion.log
                natural_10M_ingestion.log
                LUCENE-4752.patch

bq. Maybe just put a comment in IW where it calls merge.getReaders() why we 
don't access the readers list directly

Done.

bq. I started working on this (LUCENE-4830 for memory and LUCENE-4839 for 
complexity) and will run some indexing benchmarks with the Wikipedia corpus to 
see how it behaves compared to natural merging.

Now that SortingAtomicReader uses TimSort to compute the doc ID mapping and 
sort postigs lists, using SortingMergePolicy only increases the merge 
complexity by constant factors compared to a natural merge if the readers to 
merge are sorted (I'm assuming the number of segments to merge is bounded). I 
think this makes online sorting a viable option.

I ran some indexing benchmarks to see how slower indexing is with 
SortingMergePolicy. To do this I quickly patched luceneutil to add a random 
NumericDocValuesField to all documents and wrap the merge policy with 
SortingMergePolicy. Indexing 10M docs from the wikimedium collection was 2x 
slower with SortingMergePolicy (see ingestion rate logs attached). To measure 
pure merge performance, I ran a forceMerge(1) on those indexes and 
SortingMergePolicy made this forceMerge 3.5x slower (856415 ms vs 250054 ms). 
If you're curious, here is where the merging time is spent with 
SortingMergePolicy according to my profiler:
 - 32%: CompressingStoredField.visitDocument (vs. < 1% when using a regular 
merge policy)
 - 17%: TimSort: to sort the doc mapping and postings lists
 - 6%: Sorter.DocMap.oldToNew: used by SortingDocsEnum to map the old IDs to 
the new ones

Most of the time is not spent into actual sorting but in visitDocument because 
the codec-specific merge routine can't be used, so the stored fields format 
decompresses every chunk multiple times (a few hundred  times given that my 
docs are really small, this would be less noticeable with larger docs).

I think it's close, what do you think?
                
> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>         Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
> LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, 
> sorting_10M_ingestion.log
>
>
> It would be awesome if Lucene could write the documents out in a segment 
> based on a configurable order.  This of course applies to merging segments 
> to. The benefit is increased locality on disk of documents that are likely to 
> be accessed together.  This often applies to documents near each other in 
> time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-4752) Merge segments to sort them

Reply via email to