[jira] [Commented] (LUCENE-7255) Paging with SortingMergePolicy and EarlyTerminatingSortingCollector

Robert Muir (JIRA) Thu, 28 Apr 2016 08:26:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262327#comment-15262327
 ]


Robert Muir commented on LUCENE-7255:
-------------------------------------

{quote}
Ok, I see. It would be great for deep pagination to take advantage of index 
sorting without requiring the number of documents to skip. 
(LUCENE-6766|https://issues.apache.org/jira/browse/LUCENE-6766) is really 
promising 
{quote}

I agree, I think its crucial to not require the user to do a bunch of tracking. 
Otherwise it defeats the point of the searchAfter method, which is to make it 
easy for things to be more efficient if you want to page.

{quote}
We could make pagination work better in the case of sorted segments by tracking 
the last competitive document per segment rather than at the index level. This 
way, on each sorted segment, we could directly jump to the next competitive 
document, so the collector would actually only collect numWanted documents 
rather than numToSkip+numWanted. This would require a custom collector however.
{quote}

We should not let "would require a custom collector" prevent exploring this. I 
see it as, a custom collector is currently already required, and to boot: 
paging does not work with it :)

Of course, it is important that long-term this stuff can work with searchAfter 
automatically. But I don't see any proposals for how this can work now, and I'm 
pretty sure to support it, we need to "track more stuff" on behalf of the user.

There are a lot of ways that could work for this collector, a number like 
Christine's `numToSkip` combined with `topValue`, or Adrien's per-segment set 
of docIDs, or a per-segment set of docIDs combined with `topValue` (to keep 
priority queues constant size on the unsorted segments), and so on.

But I don't think we should worry about this right now. I think the searchAfter 
api will need some change regardless if we want it to work transparently, and 
that should be thought out carefully.

> Paging with SortingMergePolicy and EarlyTerminatingSortingCollector
> -------------------------------------------------------------------
>
>                 Key: LUCENE-7255
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7255
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 5.3, 5.4, 5.5, 6.0
>            Reporter: Andrés de la Peña
>              Labels: EarlyTerminatingSortingCollector, pagination, paging, 
> searchafter, sortingmergepolicy
>         Attachments: LUCENE-7255_v0.diff
>
>
> {{EarlyTerminatingSortingCollector}} seems to don't work when used with a 
> {{TopDocsCollector}} searching for documents after a certain {{FieldDoc}}. 
> That is, it can't be used for paging. The following code allows to reproduce 
> the problem:
> {code}
> // Sort to be used both with merge policy and queries
> Sort sort = new Sort(new SortedNumericSortField(FIELD_NAME, 
> SortField.Type.INT));
> // Create directory
> RAMDirectory directory = new RAMDirectory();
> // Setup merge policy
> TieredMergePolicy tieredMergePolicy = new TieredMergePolicy();
> SortingMergePolicy sortingMergePolicy = new 
> SortingMergePolicy(tieredMergePolicy, sort);
> // Setup index writer
> IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new 
> SimpleAnalyzer());
> indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
> indexWriterConfig.setMergePolicy(sortingMergePolicy);
> IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
> // Index values
> for (int i = 1; i <= 1000; i++) {
>     Document document = new Document();
>     document.add(new NumericDocValuesField(FIELD_NAME, i));
>     indexWriter.addDocument(document);
> }
> // Force index merge to ensure early termination
> indexWriter.forceMerge(1, true);
> indexWriter.commit();
> // Create index searcher
> IndexReader reader = DirectoryReader.open(directory);
> IndexSearcher searcher = new IndexSearcher(reader);
> // Paginated read
> int pageSize = 10;
> FieldDoc pageStart = null;
> while (true) {
>     logger.info("Collecting page starting at: {}", pageStart);
>     Query query = new MatchAllDocsQuery();
>     TopDocsCollector tfc = TopFieldCollector.create(sort, pageSize, 
> pageStart, true, false, false);
>     EarlyTerminatingSortingCollector collector = new 
> EarlyTerminatingSortingCollector(tfc, sort, pageSize, sort);
>     searcher.search(query, collector);
>     ScoreDoc[] scoreDocs = tfc.topDocs().scoreDocs;
>     for (ScoreDoc scoreDoc : scoreDocs) {
>         pageStart = (FieldDoc) scoreDoc;
>         logger.info("FOUND {}", scoreDoc);
>     }
>     logger.info("Terminated early: {}", collector.terminatedEarly());
>     if (scoreDocs.length < pageSize) break;
> }
> // Close
> reader.close();
> indexWriter.close();
> directory.close();
> {code}
> The query for the second page doesn't return any results. However, it gets 
> the expected results when if we don't wrap the {{TopFieldCollector}} with the 
> {{EarlyTerminatingSortingCollector}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7255) Paging with SortingMergePolicy and EarlyTerminatingSortingCollector

Reply via email to