[ 
https://issues.apache.org/jira/browse/SOLR-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263536#comment-17263536
 ] 

ASF subversion and git services commented on SOLR-14608:
--------------------------------------------------------

Commit 4f691b8bb4492bec44440c1db65cb45ab83bec1c in lucene-solr's branch 
refs/heads/jira/SOLR-14608-export-merge from Joel Bernstein
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4f691b8 ]

SOLR-14608: Faster sorting for the /export handler
Squashed commit of the following:

commit 66f85c550691fcb3b4ee1959c7ed1695aed3cb78
Author: Joel Bernstein <jbern...@apache.org>
Date:   Thu Jan 7 11:04:18 2021 -0500

    SOLR-14608: Fix tie-break in SortDoc, TestExportWriter now passing.

commit a183d29ea84dbd4a76517f27a87ef78d56bc935f
Author: Joel Bernstein <jbern...@apache.org>
Date:   Tue Jan 5 16:07:18 2021 -0500

    SOLR-14608: Fix failing TestExportWriter tests

commit d7e81e8197a4b9e06c8d387d34e941a1365ad163
Author: Joel Bernstein <jbern...@apache.org>
Date:   Tue Dec 29 12:34:02 2020 -0500

    SOLR-14608: Tone down debug logging

commit cae61336f86295014a1c373774bc56c5b9f21670
Author: Joel Bernstein <jbern...@apache.org>
Date:   Tue Dec 29 09:40:35 2020 -0500

    SOLR-14608: Fix nanoTime to millis calculation and more code cleanup

commit 4e2cd9aaeaeeeed835af6b638faeefab231d7b9a
Author: Joel Bernstein <jbern...@apache.org>
Date:   Mon Dec 28 15:00:12 2020 -0500

    SOLR-14608: Code clean up

commit 894141b3c9461880917de471285d3b8a96c0a4fe
Author: Joel Bernstein <jbern...@apache.org>
Date:   Sun Dec 27 13:56:25 2020 -0500

    SOLR-14608: Fix bug when caching docvalues objects related to the 
leafreader ord

commit 0a7ea0ef20d7b280b7d4381ad851024096addf27
Author: Joel Bernstein <jbern...@apache.org>
Date:   Wed Dec 23 16:12:59 2020 -0500

    SOLR-14608: Reuse docvalues when possible

commit 6af848b086c2002b031ea159e485f4b2f30df7c0
Author: Joel Bernstein <jbern...@apache.org>
Date:   Mon Dec 21 14:13:31 2020 -0500

    SOLR-14608: Suppress Broken pipe logging

commit f40001700778bcac4390b2f00c014ad0bf19d091
Author: Joel Bernstein <jbern...@apache.org>
Date:   Sun Dec 6 10:36:01 2020 -0500

    Test commit 2

commit 8373f3a6e1383028a517831bc561ac2f491c6ce3
Author: Joel Bernstein <jbern...@apache.org>
Date:   Sun Dec 6 10:34:26 2020 -0500

    Test commit

commit 8e9a7afddde80080150e3fd078005a8add29ac7f
Author: Andrzej Bialecki <a...@apache.org>
Date:   Thu Jul 30 15:48:28 2020 +0200

    SOLR-14608: More cleanups. Fix a bug in compareTo. Add SortDoc.equals() / 
hashCode().

commit 536d962d6e016573cafd2f420511e4f7083e0468
Author: Andrzej Bialecki <a...@apache.org>
Date:   Wed Jul 29 13:13:09 2020 +0200

    SOLR-14608: Fix generics / raw types, move around the timer metrics so that 
they make
    sense.

commit ebd5bcaab8c917b16164bcd059276558002b09a6
Author: Andrzej Bialecki <a...@apache.org>
Date:   Tue Jul 28 11:38:44 2020 +0200

    SOLR-14608: Fix code formatting.

commit bf8d954ca1289d82eb5334719fb97bbabacacb09
Merge: b610ddae2f8 6bf5f4a87f4
Author: Andrzej Bialecki <a...@apache.org>
Date:   Mon Jul 27 15:42:04 2020 +0200

    Merge branch 'master' into jira/SOLR-14608-export

commit b610ddae2f8a4258f1d6e7c842f480f2b8c46fa9
Author: Joel Bernstein <jbern...@apache.org>
Date:   Fri Jul 17 10:32:27 2020 -0400

    SOLR-14608: Cache output bytesref

commit ee9c3d083c60850c3e48d301356329cf0f017c86
Author: Joel Bernstein <jbern...@apache.org>
Date:   Wed Jul 15 15:55:48 2020 -0400

    SOLR-14608: Works with one string value sort field.

commit 32e92c5025637e4b6cd940628a491fe6b7e7cb13
Author: Joel Bernstein <jbern...@apache.org>
Date:   Mon Jul 13 11:26:56 2020 -0400

    SOLR-14608: Wire-up the MergeIterator part three

commit f747562ca60907dd542cd9bc141a5806f156db1c
Author: Joel Bernstein <jbern...@apache.org>
Date:   Mon Jul 13 10:41:39 2020 -0400

    SOLR-14608: Wire-up the MergeIterator part two

commit 970c6cf4f5abad4248ff5c686c8ec65061dfa949
Author: Joel Bernstein <jbern...@apache.org>
Date:   Mon Jul 13 09:32:52 2020 -0400

    SOLR-14608: Wire-up the MergeIterator

commit 95e706abc425003d79a037500b9887f2c8a7798c
Author: Joel Bernstein <jbern...@apache.org>
Date:   Fri Jul 10 09:53:22 2020 -0400

    SOLR-14608: Size segment level sort queues based on segement maxdoc

commit e0fc38f1b1093cd761da03e561df6395d3a79fc1
Author: Joel Bernstein <jbern...@apache.org>
Date:   Thu Jul 9 16:40:38 2020 -0400

    SOLR-14608: Add method for creating the MergeIterator

commit 9b01320ddd3800607fa0197df6ac66bfd27e148a
Author: Joel Bernstein <jbern...@apache.org>
Date:   Thu Jul 9 14:23:17 2020 -0400

    SOLR-14608: Add skeleton algorithm for segment level iterator

commit bb4ae51c1c3e54c976bd1d449d5264afa3d74ec2
Author: Joel Bernstein <jbern...@apache.org>
Date:   Thu Jul 9 13:04:39 2020 -0400

    SOLR-14608: Add basic top level merge sort iterator


> Faster sorting for the /export handler
> --------------------------------------
>
>                 Key: SOLR-14608
>                 URL: https://issues.apache.org/jira/browse/SOLR-14608
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>            Priority: Major
>
> The largest cost of the export handler is the sorting. This ticket will 
> implement an improved algorithm for sorting that should greatly increase 
> overall throughput for the export handler.
> *The current algorithm is as follows:*
> Collect a bitset of matching docs. Iterate over that bitset and materialize 
> the top level oridinals for the sort fields in the document and add them to 
> priority queue of size 30000. Then export the top 30000 docs, turn off the 
> bits in the bit set and iterate again until all docs are sorted and sent. 
> There are two performance bottlenecks with this approach:
> 1) Materializing the top level ordinals adds a huge amount of overhead to the 
> sorting process.
> 2) The size of priority queue, 30,000, adds significant overhead to sorting 
> operations.
> *The new algorithm:*
> Has a top level *merge sort iterator* that wraps segment level iterators that 
> perform segment level priority queue sorts.
> *Segment level:*
> The segment level docset will be iterated and the segment level ordinals for 
> the sort fields will be materialized and added to a segment level priority 
> queue. As the segment level iterator pops docs from the priority queue the 
> top level ordinals for the sort fields are materialized. Because the top 
> level ordinals are materialized AFTER the sort, they only need to be looked 
> up when the segment level ordinal changes. This takes advantage of the sort 
> to limit the lookups into the top level ordinal structures. This also 
> eliminates redundant lookups of top level ordinals that occur during the 
> multiple passes over the matching docset.
> The segment level priority queues can be kept smaller than 30,000 to improve 
> performance of the sorting operations because the overall batch size will 
> still be 30,000 or greater when all the segment priority queue sizes are 
> added up. This allows for batch sizes much larger then 30,000 without using a 
> single large priority queue. The increased batch size means fewer iterations 
> over the matching docset and the decreased priority queue size means faster 
> sorting operations.
> *Top level:*
> A top level iterator does a merge sort over the segment level iterators by 
> comparing the top level ordinals materialized when the segment level docs are 
> popped from the segment level priority queues. This requires no extra memory 
> and will be very performant.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to