[
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701946#comment-16701946
]
Toke Eskildsen commented on SOLR-13013:
---------------------------------------
[~joel.bernstein] Unfortunately I don't have proper hardware at hand to test
with our large shards in a multi-shard setup. I _could_ put them on a spinning
drive, now that I think about it, but I am also afraid that my test-box does
not have adequate memory to fully cache the DocValues structures when using
multiple shards, so that would complicate testing somewhat. I'll see what else
we have lying around and if nothing else, I could just delete 3/4th of the data
in 4 of the shards and run with those instead (takes some days to do though).
Up until now we have used export exclusively to do simple query-bases
data-dumps, so that was my go-to case. It is probably due to my limited
understanding of Streaming Expressions that I do not understand the
methodological problem in my test:
I get that multi-sharding, replicas and hashing (bit unsure about the hashing
part) can distribute and parallelize the load to make processing faster, but
only the "more and smaller shards" of those 3 would reduce the total amount of
work, as I understand it? So with regard to that, any optimization to the
export should work equally well for a single-shard simple export and a more
complex distributed setup, measured as total work to be done?
I am on (even) more shaky grounds with the local reduce operation. Isn't that
step after the export part and therefore extremely dependent on raw export
speed? Or is there some shortcut mechanism I haven't understood?
> Change export to extract DocValues in docID order
> -------------------------------------------------
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Export Writer
> Affects Versions: 7.5, master (8.0)
> Reporter: Toke Eskildsen
> Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch,
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for
> paging through the result set in a given sort order. Each time a window has
> been calculated, the values for the export fields are retrieved from the
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support
> random access. The current export implementation bypasses this by creating a
> new DocValues-iterator for each individual value to retrieve. This slows down
> export as the iterator has to seek to the given docID from start for each
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An
> alternative is to extract the DocValues in docID-order, with re-use of
> DocValues-iterators. The idea is as follows:
> # Change the FieldWriters for export to re-use the DocValues-iterators if
> subsequent requests are for docIDs higher than the previous ones
> # Calculate the sliding window of SortDocs as usual
> # Take a note of the order of the SortDocs in the sliding window
> # Re-sort the SortDocs in docID-order
> # Extract the DocValues to a temporary on-heap structure
> # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold
> the whole sliding window scaled result set in memory. This might well be a
> showstopper as there is no real limit to how large this partial result set
> can be. Maybe such an optimization could be requested explicitly if the user
> knows that there is enough memory?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]