[ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701727#comment-16701727 ]
Toke Eskildsen commented on SOLR-13013: --------------------------------------- I cherry-picked some DENSE fields from our netarchive index and tried exporting them from a single shard, to demonstrate the problem with large indexes in Lucene/Solr 7+ and to performance test the current patch. I made sure everything was warmed (practically zero IO on the index-SSD according to iostat) and tested with combinations of SOLR-13013 and LUCENE-8374 turned on and off: {code} > curl -s "http://localhost:9090/solr/ns80/select?q=*:*" | jq .response.numFound 307171504 > curl -s "http://localhost:9090/solr/ns80/select?q=text:hestevogn" | jq > .response.numFound' 52654 > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=true&lucene8374=true" > -o t_export_true_true 0.433661 seconds > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=true&lucene8374=false" > -o t_export_true_false 0.555844 seconds > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=false&lucene8374=true" > -o t_export_false_true 1.037004 seconds > curl -s -w "%\{time_total} seconds"$\'\n\' > "http://localhost:9090/solr/ns80/export?q=text:hestevogn&sort=id+asc&fl=content_type_ext,content_type_served,crawl_date,content_length&solr13013=false&lucene8374=false" > -o t_export_false_false 843.477925 seconds > diff -s t_export_true_true t_export_true_false ; diff -s t_export_true_true > t_export_false_true ; diff -s t_export_true_true t_export_false_false Files t_export_true_true and t_export_true_false are identical Files t_export_true_true and t_export_false_true are identical Files t_export_true_true and t_export_false_false are identical {code} Observations from this ad-hoc test (which of course should be independently verified): Exporting from a large index with vanilla Solr master is not ideal. It does not make much sense to talk about what performance-factors the patches provides as they are mostly about changing time complexity: Our factor 1500 speed-up with SOLR-13013 with this shard with this request will be something quite else for other setups. * The explicit sort in SOLR-13013 seems the superior solution and the addition of the O(n) → O(1) lookup-improvement in LUCENE-8374 only makes it slightly faster. * On the other hand, LUCENE-8374 works quite well for export and does not require any changes to export. This might influence whether or not energy should be spend on a "best as possible" fallback in case of memory problems or if simpler "full fallback to sliding window sort order" is preferable. * On the gripping hand, testing with a smaller index is likely to result in SOLR-13013 being (relative to LUCENE-8374) even faster, as SOLR-13013 avoids re-opening DV-readers all the time. More testing needed (no surprise there). > Change export to extract DocValues in docID order > ------------------------------------------------- > > Key: SOLR-13013 > URL: https://issues.apache.org/jira/browse/SOLR-13013 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer > Affects Versions: 7.5, master (8.0) > Reporter: Toke Eskildsen > Priority: Major > Fix For: master (8.0) > > Attachments: SOLR-13013_proof_of_concept.patch, > SOLR-13013_proof_of_concept.patch > > > The streaming export writer uses a sliding window of 30,000 documents for > paging through the result set in a given sort order. Each time a window has > been calculated, the values for the export fields are retrieved from the > underlying DocValues structures in document sort order and delivered. > The iterative DocValues API introduced in Lucene/Solr 7 does not support > random access. The current export implementation bypasses this by creating a > new DocValues-iterator for each individual value to retrieve. This slows down > export as the iterator has to seek to the given docID from start for each > value. The slowdown scales with shard size (see LUCENE-8374 for details). An > alternative is to extract the DocValues in docID-order, with re-use of > DocValues-iterators. The idea is as follows: > # Change the FieldWriters for export to re-use the DocValues-iterators if > subsequent requests are for docIDs higher than the previous ones > # Calculate the sliding window of SortDocs as usual > # Take a note of the order of the SortDocs in the sliding window > # Re-sort the SortDocs in docID-order > # Extract the DocValues to a temporary on-heap structure > # Re-sort the extracted values to the original sliding window order > Deliver the values > One big difference from the current export code is of course the need to hold > the whole sliding window scaled result set in memory. This might well be a > showstopper as there is no real limit to how large this partial result set > can be. Maybe such an optimization could be requested explicitly if the user > knows that there is enough memory? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org