[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701727#comment-16701727
 ] 

Toke Eskildsen commented on SOLR-13013:
---------------------------------------

I cherry-picked some DENSE fields from our netarchive index and tried exporting 
them from a single shard, to demonstrate the problem with large indexes in 
Lucene/Solr 7+ and to performance test the current patch.

I made sure everything was warmed (practically zero IO on the index-SSD 
according to iostat) and tested with combinations of SOLR-13013 and LUCENE-8374 
turned on and off:
{code}
> curl -s "http://localhost:9090/solr/ns80/select?q=*:*"; | jq .response.numFound
307171504

> curl -s "http://localhost:9090/solr/ns80/select?q=text:hestevogn"; | jq 
> .response.numFound'
52654

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=true&lucene8374=true";
>  -o t_export_true_true
0.433661 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=true&lucene8374=false";
>  -o t_export_true_false
0.555844 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=false&lucene8374=true";
>  -o t_export_false_true
1.037004 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=false&lucene8374=false";
>  -o t_export_false_false
843.477925 seconds

> diff -s t_export_true_true t_export_true_false ; diff -s t_export_true_true 
> t_export_false_true ; diff -s t_export_true_true t_export_false_false
Files t_export_true_true and t_export_true_false are identical
Files t_export_true_true and t_export_false_true are identical
Files t_export_true_true and t_export_false_false are identical
{code}
Observations from this ad-hoc test (which of course should be independently 
verified): Exporting from a large index with vanilla Solr master is not ideal. 
It does not make much sense to talk about what performance-factors the patches 
provides as they are mostly about changing time complexity: Our factor 1500 
speed-up with SOLR-13013 with this shard with this request will be something 
quite else for other setups.
 * The explicit sort in SOLR-13013 seems the superior solution and the addition 
of the O(n) → O(1) lookup-improvement in LUCENE-8374 only makes it slightly 
faster.
 * On the other hand, LUCENE-8374 works quite well for export and does not 
require any changes to export. This might influence whether or not energy 
should be spend on a "best as possible" fallback in case of memory problems or 
if simpler "full fallback to sliding window sort order" is preferable.
 * On the gripping hand, testing with a smaller index is likely to result in 
SOLR-13013 being (relative to LUCENE-8374) even faster, as SOLR-13013 avoids 
re-opening DV-readers all the time. More testing needed (no surprise there).

> Change export to extract DocValues in docID order
> -------------------------------------------------
>
>                 Key: SOLR-13013
>                 URL: https://issues.apache.org/jira/browse/SOLR-13013
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Export Writer
>    Affects Versions: 7.5, master (8.0)
>            Reporter: Toke Eskildsen
>            Priority: Major
>             Fix For: master (8.0)
>
>         Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to