[ 
https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591774#comment-13591774
 ] 

Shai Erera commented on LUCENE-3918:
------------------------------------

The approach taken by this issue (well, originally by the 3x IndexSorter) is 
that you do an offline sorting of an entire index. So if you e.g. have a 
10-segments index, you end up with a single segment, totally sorted across all 
documents.

At least from my understanding of how the online sorting would work 
(LUCENE-4752), the Codec would need to determine beforehand the permutation on 
the documents, or build an in-memory segment and then when it's done, sort it 
and write it sorted, right? Otherwise, I don't understand how it can handle 
these series of addDocuments (assume the value denotes the location of the 
document in the sorted index): doc(2), doc(1), doc(7), doc(0)...? The stored 
fields and term-vectors are not cached in-memory today. The location of the 
document in the sorted index is unknown until all keys (by which you sort) are 
encountered, which may be too late for the Codec?

And even if you get passed that hurdle (say you're willing to cache everything 
in-memory and then flush to disk sorted), how will you handle merges? So now 
you have an index with segments 1,2,3 (each sorted). How do you merge-sort 
them? Today, you don't have the API for it, so let's say that we add it 
(plugging-in your own SegmentMerger). Now MP selects segments 1,2 for merge, so 
you end up with segments 3,4, which are again each sorted separately, but the 
index is not globally sorted, right? In a sorted index, the segments need to 
have a consistent > (or <) relationship between the segments .. or otherwise 
you're just traversing documents in random order.

In short, if you do come up with a reasonable way to do online index sorting 
(on LUCENE-4752), I'll be all for it. And if it will make sense, we can even 
drop the offline index sorter too. But I think that there are many challenges 
in getting it right, and efficiently. It's not a mere Codec trick IMO.

Also, note that as far as memory consumption for offline sorting, we only cache 
in memory the current posting lists that's sorted (the rest relies on 
pre-existing random access API).

But, I could be totally missing your idea for online sorting, in which case I'd 
appreciate if you elaborate how you think it can be done. But I prefer that we 
discuss that on LUCENE-4752.
                
> Port index sorter to trunk APIs
> -------------------------------
>
>                 Key: LUCENE-3918
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3918
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: modules/other
>    Affects Versions: 4.0-ALPHA
>            Reporter: Robert Muir
>             Fix For: 4.2, 5.0
>
>         Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
> LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch
>
>
> LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
> functionality to 4.0 apis.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to