[
https://issues.apache.org/jira/browse/SOLR-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356406#comment-17356406
]
Jan Høydahl commented on SOLR-14608:
------------------------------------
Hi, what is the status of this JIRA? I see commits and open git branches but
this is not resolved?
Can you delete the central git branches when done?
https://github.com/apache/solr/tree/jira/SOLR-14608-export and
https://github.com/apache/solr/tree/jira/SOLR-14608-export-merge
> Faster sorting for the /export handler
> --------------------------------------
>
> Key: SOLR-14608
> URL: https://issues.apache.org/jira/browse/SOLR-14608
> Project: Solr
> Issue Type: New Feature
> Affects Versions: main (9.0)
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
> Priority: Major
> Fix For: main (9.0)
>
>
> The largest cost of the export handler is the sorting. This ticket will
> implement an improved algorithm for sorting that should greatly increase
> overall throughput for the export handler.
> *The current algorithm is as follows:*
> Collect a bitset of matching docs. Iterate over that bitset and materialize
> the top level oridinals for the sort fields in the document and add them to
> priority queue of size 30000. Then export the top 30000 docs, turn off the
> bits in the bit set and iterate again until all docs are sorted and sent.
> There are two performance bottlenecks with this approach:
> 1) Materializing the top level ordinals adds a huge amount of overhead to the
> sorting process.
> 2) The size of priority queue, 30,000, adds significant overhead to sorting
> operations.
> *The new algorithm:*
> Has a top level *merge sort iterator* that wraps segment level iterators that
> perform segment level priority queue sorts.
> *Segment level:*
> The segment level docset will be iterated and the segment level ordinals for
> the sort fields will be materialized and added to a segment level priority
> queue. As the segment level iterator pops docs from the priority queue the
> top level ordinals for the sort fields are materialized. Because the top
> level ordinals are materialized AFTER the sort, they only need to be looked
> up when the segment level ordinal changes. This takes advantage of the sort
> to limit the lookups into the top level ordinal structures. This also
> eliminates redundant lookups of top level ordinals that occur during the
> multiple passes over the matching docset.
> The segment level priority queues can be kept smaller than 30,000 to improve
> performance of the sorting operations because the overall batch size will
> still be 30,000 or greater when all the segment priority queue sizes are
> added up. This allows for batch sizes much larger then 30,000 without using a
> single large priority queue. The increased batch size means fewer iterations
> over the matching docset and the decreased priority queue size means faster
> sorting operations.
> *Top level:*
> A top level iterator does a merge sort over the segment level iterators by
> comparing the top level ordinals materialized when the segment level docs are
> popped from the segment level priority queues. This requires no extra memory
> and will be very performant.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]