[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699385#comment-16699385 ]
Toke Eskildsen commented on SOLR-13013: --------------------------------------- [~joel.bernstein] I am glad that it looks useful. I expect that it needs at least a full re-implementation of the {{MapWriter}}-parts. I am unfamiliar with that part of the code, and it would be great if you took over. I won't take any offense if you rewrite everything. I'd be happy to try and review or at least do some testing on our 300M docs/segment shards as they are very affected by the DV API change. > Change export to extract DocValues in docID order > ------------------------------------------------- > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer > Affects Versions: 7.5, master (8.0) > Reporter: Toke Eskildsen > Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org