[
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698207#comment-16698207
]
Toke Eskildsen commented on SOLR-13013:
---------------------------------------
[~janhoy] Nice idea! The values are collected as Objects, so it would involve
some instanceof-checks to estimate memory overhead. I am unsure how that would
affect performance, but it would be great if we could both avoid the risk of
OOM and make it simpler for the user.
Technically a fallback would be easy to do. It would even be possible to do a
partial fallback: If the memory limit for the buffered values is reached before
all 30K SortDocs has been processed, switch back to standard "sort order
DV-resolving with immediate delivery", but use the already collected values
whenever possible.
Adjusting windows size for subsequent windows is tricky, as is requires
weighing query + sort cost vs. DV-retrieval cost. It would be possible to
collect a run time profile of the two parts and use that for qualified
guessing, but then it's sounding like quite a large project.
If we determine that the base idea of this JIRA-issue has merit, the first
version could just use a simple fallback to sort order DV-resolving, without
any re-use of already collected values, and stay in that mode for the rest of
the current export. Re-use and/or window shrinking could be later enhancements.
> Change export to extract DocValues in docID order
> -------------------------------------------------
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Export Writer
> Affects Versions: 7.5, master (8.0)
> Reporter: Toke Eskildsen
> Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for
> paging through the result set in a given sort order. Each time a window has
> been calculated, the values for the export fields are retrieved from the
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support
> random access. The current export implementation bypasses this by creating a
> new DocValues-iterator for each individual value to retrieve. This slows down
> export as the iterator has to seek to the given docID from start for each
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An
> alternative is to extract the DocValues in docID-order, with re-use of
> DocValues-iterators. The idea is as follows:
> # Change the FieldWriters for export to re-use the DocValues-iterators if
> subsequent requests are for docIDs higher than the previous ones
> # Calculate the sliding window of SortDocs as usual
> # Take a note of the order of the SortDocs in the sliding window
> # Re-sort the SortDocs in docID-order
> # Extract the DocValues to a temporary on-heap structure
> # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold
> the whole sliding window scaled result set in memory. This might well be a
> showstopper as there is no real limit to how large this partial result set
> can be. Maybe such an optimization could be requested explicitly if the user
> knows that there is enough memory?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]