[
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701898#comment-16701898
]
Joel Bernstein commented on SOLR-13013:
---------------------------------------
Interesting findings. I can work on getting this patch committed, possibly for
the 8.0 release.
A couple of thoughts about the design of the /export handler.
The /export handler was very much designed to support MapReduce operations
(distributed grouping, rollups, relational algebra) in Streaming Expressions.
Scaling these MapReduce operations took the following path:
1) Sharding: The /export handler benefits tremendously by sharding. The
benefits go well beyond linear. This is because 2 shards both doubles the
computing power and more then halves the amount of work that needs to done by
each shard.
3) Hash partitioning and worker collections: Sharding very quickly causes
bottlenecks on a single aggregator node. Streaming Expressions parallel
function when combined with the hash partitioner allows the /exports to be
partitioned into X number of slices and brings into play not just the shards
but the replicas. When a reduce operations happens on the worker nodes
(rollups, innerJoins) which limits the numbers of records that are emitted in
the final stream, this is an extremely powerful scaling tool.
So, from a pure /export standpoint with no reduce operation, all from a single
shard, you are working somewhat against the design goals of system. But that
being said the faster we make the pure export form a single shard the more use
cases the the /export handler serves.
> Change export to extract DocValues in docID order
> -------------------------------------------------
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Export Writer
> Affects Versions: 7.5, master (8.0)
> Reporter: Toke Eskildsen
> Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch,
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for
> paging through the result set in a given sort order. Each time a window has
> been calculated, the values for the export fields are retrieved from the
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support
> random access. The current export implementation bypasses this by creating a
> new DocValues-iterator for each individual value to retrieve. This slows down
> export as the iterator has to seek to the given docID from start for each
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An
> alternative is to extract the DocValues in docID-order, with re-use of
> DocValues-iterators. The idea is as follows:
> # Change the FieldWriters for export to re-use the DocValues-iterators if
> subsequent requests are for docIDs higher than the previous ones
> # Calculate the sliding window of SortDocs as usual
> # Take a note of the order of the SortDocs in the sliding window
> # Re-sort the SortDocs in docID-order
> # Extract the DocValues to a temporary on-heap structure
> # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold
> the whole sliding window scaled result set in memory. This might well be a
> showstopper as there is no real limit to how large this partial result set
> can be. Maybe such an optimization could be requested explicitly if the user
> knows that there is enough memory?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]