[
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591788#comment-13591788
]
David Smiley commented on LUCENE-4752:
--------------------------------------
In response to Shai's comment
https://issues.apache.org/jira/browse/LUCENE-3918?focusedCommentId=13591774&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13591774
I suppose there's little point in writing a sorted segment if they are going to
have to be merged any way. In the scope of a document's lifespan, it will live
shortly in its initial segment, and that segment will be the smallest. And
sorting it before writing slows-down real-time-search requirements. So this
issue should be about merging to create sorted segments.
Shai pointed out that if we simply merge to create sorted segments, the result
doesn't improve very much since the documents are effectively randomly striped
across the segments, even if a segment itself is sorted.
So I think the aim should be to create segments that are not only internally
sorted but are sorted across them. This is hard but I think it's doable.
So assume the segments are initially written unsorted. Eventually there are
too many of these segments and we need to start merging. So if we have N such
segments, we want N/2 (or use the merge factor but assume 1/2 for the
discussion) segments that are sorted individually and across. I think this
could be done by looking at a merged DocValues view of the sorted field (using
RAM of course and I suspect plenty of existing Lucene code that does this for
cache/search context) then dividing up the value space, and then begin to pluck
out documents from these segments in order to generate the first segment, then
second, etc. Thinking of how to do this for a particular range of DocValues
that will become a segment, I think you first generate a bitset of those
docids. Then perhaps you use a SlowCompositeReaderWrapper to see a merged view
of the applicable segments using the bitset as a filter for the documents. I
expect I'm overlooking challenges but I'm sure other smart people will point
them out :-)
> Sort documents when writing or merging segments
> -----------------------------------------------
>
> Key: LUCENE-4752
> URL: https://issues.apache.org/jira/browse/LUCENE-4752
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/index
> Reporter: David Smiley
> Assignee: Adrien Grand
>
> It would be awesome if Lucene could write the documents out in a segment
> based on a configurable order. This of course applies to merging segments
> to. The benefit is increased locality on disk of documents that are likely to
> be accessed together. This often applies to documents near each other in
> time, but also spatially.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]