[
https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591774#comment-13591774
]
Shai Erera commented on LUCENE-3918:
------------------------------------
The approach taken by this issue (well, originally by the 3x IndexSorter) is
that you do an offline sorting of an entire index. So if you e.g. have a
10-segments index, you end up with a single segment, totally sorted across all
documents.
At least from my understanding of how the online sorting would work
(LUCENE-4752), the Codec would need to determine beforehand the permutation on
the documents, or build an in-memory segment and then when it's done, sort it
and write it sorted, right? Otherwise, I don't understand how it can handle
these series of addDocuments (assume the value denotes the location of the
document in the sorted index): doc(2), doc(1), doc(7), doc(0)...? The stored
fields and term-vectors are not cached in-memory today. The location of the
document in the sorted index is unknown until all keys (by which you sort) are
encountered, which may be too late for the Codec?
And even if you get passed that hurdle (say you're willing to cache everything
in-memory and then flush to disk sorted), how will you handle merges? So now
you have an index with segments 1,2,3 (each sorted). How do you merge-sort
them? Today, you don't have the API for it, so let's say that we add it
(plugging-in your own SegmentMerger). Now MP selects segments 1,2 for merge, so
you end up with segments 3,4, which are again each sorted separately, but the
index is not globally sorted, right? In a sorted index, the segments need to
have a consistent > (or <) relationship between the segments .. or otherwise
you're just traversing documents in random order.
In short, if you do come up with a reasonable way to do online index sorting
(on LUCENE-4752), I'll be all for it. And if it will make sense, we can even
drop the offline index sorter too. But I think that there are many challenges
in getting it right, and efficiently. It's not a mere Codec trick IMO.
Also, note that as far as memory consumption for offline sorting, we only cache
in memory the current posting lists that's sorted (the rest relies on
pre-existing random access API).
But, I could be totally missing your idea for online sorting, in which case I'd
appreciate if you elaborate how you think it can be done. But I prefer that we
discuss that on LUCENE-4752.
> Port index sorter to trunk APIs
> -------------------------------
>
> Key: LUCENE-3918
> URL: https://issues.apache.org/jira/browse/LUCENE-3918
> Project: Lucene - Core
> Issue Type: Task
> Components: modules/other
> Affects Versions: 4.0-ALPHA
> Reporter: Robert Muir
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch,
> LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch
>
>
> LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
> functionality to 4.0 apis.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]