[jira] [Commented] (LUCENE-4752) Sort documents when writing or merging segments

David Smiley (JIRA) Sun, 03 Mar 2013 09:23:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591788#comment-13591788
 ]


David Smiley commented on LUCENE-4752:
--------------------------------------

In response to Shai's comment 
https://issues.apache.org/jira/browse/LUCENE-3918?focusedCommentId=13591774&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13591774

I suppose there's little point in writing a sorted segment if they are going to 
have to be merged any way.  In the scope of a document's lifespan, it will live 
shortly in its initial segment, and that segment will be the smallest.  And 
sorting it before writing slows-down real-time-search requirements.  So this 
issue should be about merging to create sorted segments.

Shai pointed out that if we simply merge to create sorted segments, the result 
doesn't improve very much since the documents are effectively randomly striped 
across the segments, even if a segment itself is sorted.

So I think the aim should be to create segments that are not only internally 
sorted but are sorted across them.  This is hard but I think it's doable.

So assume the segments are initially written unsorted.  Eventually there are 
too many of these segments and we need to start merging.  So if we have N such 
segments, we want N/2 (or use the merge factor but assume 1/2 for the 
discussion) segments that are sorted individually and across. I think this 
could be done by looking at a merged DocValues view of the sorted field (using 
RAM of course and I suspect plenty of existing Lucene code that does this for 
cache/search context) then dividing up the value space, and then begin to pluck 
out documents from these segments in order to generate the first segment, then 
second, etc.  Thinking of how to do this for a particular range of DocValues 
that will become a segment, I think you first generate a bitset of those 
docids.  Then perhaps you use a SlowCompositeReaderWrapper to see a merged view 
of the applicable segments using the bitset as a filter for the documents.  I 
expect I'm overlooking challenges but I'm sure other smart people will point 
them out :-)
                
> Sort documents when writing or merging segments
> -----------------------------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>
> It would be awesome if Lucene could write the documents out in a segment 
> based on a configurable order.  This of course applies to merging segments 
> to. The benefit is increased locality on disk of documents that are likely to 
> be accessed together.  This often applies to documents near each other in 
> time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4752) Sort documents when writing or merging segments

Reply via email to