[ 
https://issues.apache.org/jira/browse/SOLR-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-14608:
----------------------------------
    Description: 
The largest cost of the export handler is the sorting. This ticket will 
implement an improved algorithm for sorting that should greatly improve overall 
throughput for the export handler.

*The currently algorithm is as follows:*

Collect a bitset of match docs. Iterate over that bitset and materialize the 
top level oridinals for the sort fields in the document and add them to 
priority queue of size 30000. Then export the top 30000 docs, turn off the bits 
in the bit set and iterate again until all docs are sorted and sent. 

There are two performance bottlenecks with this approach:

1) Materializing the top level ordinals adds a huge amount of overhead to the 
sorting process.

2) The size of priority, 30,000 adds significant overhead.

*The new algorithm:*

Calls for a top level merge sort iterator that wraps segment level iterators 
that perform segment level priority queue sorts.

*At the segment level:*

The segment level docset will be iterated and the segment level ordinals for 
the sort fields will be materialized and added to a segment level priority 
queue. As the segment level iterator pops docs from the queue the top level 
ordinals for the sort fields are materialized. Because the top level ordinals 
are materialized AFTER the sort, they only need to be looked up when the 
segment level ordinal changes. This takes advantage of the sort to limit the 
lookups into the top level ordinal structures.

The segment level priority queues can be kept smaller than 30,000 to improve 
performance of the sorting operations because the overall batchsize will still 
be 30,000 or greater when all the segment priority queue sizes are added up.

*At the top level:*

A top level iterator does a merge sort over the segment level iterators by 
comparing the top level ordinals materialized when the segment level docs are 
popped from the segment level priority queues. This requires no extra memory 
and will be very performant.

 

 

 

 

 

Sorting algorithm to come shortly...

  was:
The largest cost of the export handler is the sorting. This ticket will 
implement an improved algorithm for sorting that should greatly improve overall 
throughput for the export handler.

*The currently algorithm is as follows:*

Collect a bitset of match docs. Iterate over that bitset and materialize the 
top level oridinals for the sort fields in the document and add them to 
priority queue of size 30000. Then export the top 30000 docs, turn off the bits 
in the bit set and iterate again until all docs are sorted and sent. 

There are two performance bottlenecks with this approach:

1) Materializing the top level ordinals adds a huge amount of overhead to the 
sorting process.

2) The size of priority, 30,000 adds significant overhead.

*The new algorithm:*

Calls for a top level merge sort iterator that wraps segment level iterators 
that perform segment level priority queue sorts.

*At the segment level:*

The segment level docset will be iterated and the segment level ordinals for 
the sort fields will be materialized and added to a segment level priority 
queue. As the segment level iterator pops docs from the queue the top level 
ordinals for the sort fields are materialized. Because the top level ordinals 
are materialized AFTER the sort, they only need to be looked up when the 
segment level ordinal changes. This takes advantage of the sort to limit the 
lookups into the top level ordinal structures.

The segment level priority queues can be kept smaller than 30,000 to improve 
performance of the sorting operations because the overall batchsize will still 
be 30,000 or greater when all the segments priority queue sizes are added up.

*At the top level:*

A top level iterator does a merge sort over the segment level iterators by 
comparing the top level ordinals materialized when the segment level docs are 
popped from the segment level priority queues. This requires no extra memory 
and will be very performant.

 

 

 

 

 

Sorting algorithm to come shortly...


> Faster sorting for the /export handler
> --------------------------------------
>
>                 Key: SOLR-14608
>                 URL: https://issues.apache.org/jira/browse/SOLR-14608
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Joel Bernstein
>            Priority: Major
>
> The largest cost of the export handler is the sorting. This ticket will 
> implement an improved algorithm for sorting that should greatly improve 
> overall throughput for the export handler.
> *The currently algorithm is as follows:*
> Collect a bitset of match docs. Iterate over that bitset and materialize the 
> top level oridinals for the sort fields in the document and add them to 
> priority queue of size 30000. Then export the top 30000 docs, turn off the 
> bits in the bit set and iterate again until all docs are sorted and sent. 
> There are two performance bottlenecks with this approach:
> 1) Materializing the top level ordinals adds a huge amount of overhead to the 
> sorting process.
> 2) The size of priority, 30,000 adds significant overhead.
> *The new algorithm:*
> Calls for a top level merge sort iterator that wraps segment level iterators 
> that perform segment level priority queue sorts.
> *At the segment level:*
> The segment level docset will be iterated and the segment level ordinals for 
> the sort fields will be materialized and added to a segment level priority 
> queue. As the segment level iterator pops docs from the queue the top level 
> ordinals for the sort fields are materialized. Because the top level ordinals 
> are materialized AFTER the sort, they only need to be looked up when the 
> segment level ordinal changes. This takes advantage of the sort to limit the 
> lookups into the top level ordinal structures.
> The segment level priority queues can be kept smaller than 30,000 to improve 
> performance of the sorting operations because the overall batchsize will 
> still be 30,000 or greater when all the segment priority queue sizes are 
> added up.
> *At the top level:*
> A top level iterator does a merge sort over the segment level iterators by 
> comparing the top level ordinals materialized when the segment level docs are 
> popped from the segment level priority queues. This requires no extra memory 
> and will be very performant.
>  
>  
>  
>  
>  
> Sorting algorithm to come shortly...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to