[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982574#comment-16982574
 ] 

Jason Gerlowski edited comment on SOLR-13013 at 11/26/19 3:04 PM:
------------------------------------------------------------------

The latest patch supports 3 /export "methods", based on the value of the 
(nocommit) {{solr13013}} param:

* fetch docValues out of order i.e. the pre-existing implementation 
({{solr13013=3}})
* fetch docValues in docId order, using Toke's SortingMap-based implementation 
({{solr13013=1}})
* fetch docValues in docId order, using a bitmap-based implementation I have 
added.  More details below.  ({{solr13013=2}})

The new implementation I've added in this patch creates a bitmap to represent 
all docs, sets bits corresponding to the docs in the current "window", and then 
iterates over the bitmap to fetch docValues.  This performs a tad bit worse 
than Toke's implementation but pays dividends in simplicity.  If we pursue this 
implementation, we can drop a handful of classes required by original 
implementation: {{SortingMap}}, {{OrderWrtier}}, {{Entry}}, {{MapEntry}}, 
{{ListEntry}}, {{DocEntryList}}, {{AtomicEntry}}, {{SortingWriter}}.

I've still left the original (Toke's) implementation in though, as the slight 
performance edge it has may be worth the complexity.

Still to do:
* Decide what parameters we want to expose to users for switching between 
methods, or tuning window size
* Decide on an implementation.
* Randomize TestExportWriter testing to use different methods.  And/Or add 
additional tests.



was (Author: gerlowskija):
The latest patch supports 3 /export "methods", based on the value of the 
(nocommit) {{solr13013}} param:

* fetch docValues out of order i.e. the pre-existing implementation 
({{solr13013=3}})
* fetch docValues in docId order, using Toke's SortingMap-based implementation 
({{solr13013=1}})
* fetch docValues in docId order, using a bitmap-based implementation I have 
added.  More details below.  ({{solr-13013=2}})

The new implementation I've added in this patch creates a bitmap to represent 
all docs, sets bits corresponding to the docs in the current "window", and then 
iterates over the bitmap to fetch docValues.  This performs a tad bit worse 
than Toke's implementation but pays dividends in simplicity.  If we pursue this 
implementation, we can drop a handful of classes required by original 
implementation: {{SortingMap}}, {{OrderWrtier}}, {{Entry}}, {{MapEntry}}, 
{{ListEntry}}, {{DocEntryList}}, {{AtomicEntry}}, {{SortingWriter}}.

I've still left the original (Toke's) implementation in though, as the slight 
performance edge it has may be worth the complexity.

Still to do:
* Decide what parameters we want to expose to users for switching between 
methods, or tuning window size
* Decide on an implementation.
* Randomize TestExportWriter testing to use different methods.  And/Or add 
additional tests.


> Change export to extract DocValues in docID order
> -------------------------------------------------
>
>                 Key: SOLR-13013
>                 URL: https://issues.apache.org/jira/browse/SOLR-13013
>             Project: Solr
>          Issue Type: Improvement
>          Components: Export Writer
>    Affects Versions: 7.5, 8.0
>            Reporter: Toke Eskildsen
>            Assignee: Jason Gerlowski
>            Priority: Major
>         Attachments: SOLR-13013.patch, SOLR-13013.patch, 
> SOLR-13013_proof_of_concept.patch, SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to