[
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4752:
---------------------------------
Attachment: LUCENE-4752.patch
bq. I think these are not bad numbers.
Me neither! I'm rather happy with them actually.
bq. As for search, perhaps we can quickly hack up IndexSearcher to allow
terminating per-segment and then compare two Collectors TopFields and
TopSortedFields [...] but in order to do that, we must make sure that each
segment is sorted (i.e. those that are not hit by MP are still in random
order), or we somehow mark on each segment whether it's sorted or not
The attached patch contains a different approach, the idea is to use together
SortingMergePolicy and IndexWriterConfig.getMaxBufferedDocs: this guarantees
that all segments whose size is above maxBufferedDocs are sorted. Then there is
a new EarlyTerminationIndexSearcher that extends search to collect normally
segments in random order and to early terminate collection on segments which
are sorted.
bq. Accessing "close" documents together ... we can make an artificial test
which accesses documents with sort-by-value in a specific range. But that's a
too artificial test, not sure what it will tell us.
Yes, I think the important thing to validate here is that merging does not get
exponentially slower as segments grow. Other checks are just bonus.
> Merge segments to sort them
> ---------------------------
>
> Key: LUCENE-4752
> URL: https://issues.apache.org/jira/browse/LUCENE-4752
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/index
> Reporter: David Smiley
> Assignee: Adrien Grand
> Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch,
> LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch,
> natural_10M_ingestion.log, sorting_10M_ingestion.log
>
>
> It would be awesome if Lucene could write the documents out in a segment
> based on a configurable order. This of course applies to merging segments
> to. The benefit is increased locality on disk of documents that are likely to
> be accessed together. This often applies to documents near each other in
> time, but also spatially.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]