mustafasrepo commented on issue #8572:
URL:
https://github.com/apache/arrow-datafusion/issues/8572#issuecomment-1860316023
At the time of the original PR, we did a benchmark for comparison with
following results.
<h1>PLAN V1</h1>
<pre><code class="language-rust">"SortPreservingMergeExec: [a@0 ASC
NULLS LAST]",
" SortPreservingRepartitionExec: partitioning=Hash([b@0], 16),
input_partitions=16",
" RepartitionExec: partitioning=Hash([a@0], 16),
input_partitions=1",
" MemoryExec: partitions=1,
partition_sizes=[<depends_on_batch_size>], output_ordering:
[PhysicalSortExpr { expr: Column { name: \\"a\\", index: 0 },
options: SortOptions { descending: false, nulls_first: false } }]",
</code></pre>
<h1>PLAN V2</h1>
<pre><code class="language-rust">"SortPreservingMergeExec: [a@0 ASC
NULLS LAST]",
" SortExec: expr=[a@0 ASC NULLS LAST]",
" RepartitionExec: partitioning=Hash([b@0], 16),
input_partitions=16",
" RepartitionExec: partitioning=Hash([a@0], 16),
input_partitions=1",
" MemoryExec: partitions=1,
partition_sizes=[(<depends_on_batch_size>)], output_ordering:
[PhysicalSortExpr { expr: Column { name: \\"a\\", index: 0 },
options: SortOptions { descending: false, nulls_first: false } }]",
</code></pre>
n_elem | n_trial | batch_size | elapsed_mean(v1) | elapsed_median(v1) |
elapsed_mean(v2) | elapsed_median(v2)
-- | -- | -- | -- | -- | -- | --
100000 | 10 | 25 | 581.791054ms | 581.777708ms | 1.112292862s | 1.117181625s
100000 | 10 | 50 | 373.18927ms | 373.231708ms | 385.533966ms | 385.685291ms
100000 | 10 | 100 | 240.91572ms | 240.570833ms | 239.308708ms | 239.303709ms
100000 | 10 | 1000 | 115.471541ms | 113.817709ms | 100.713324ms |
100.647333ms
<!-- notionvc: 5eca7a10-df76-40ba-81f3-a0e27c688240 -->
After these results, we decided to use existing default behavior (where V2
is preferred over V1) to not hurt performance for others. Frankly, changing
default would be better for our use cases. As @alamb suggests if there is
concencus we can change the default. I wonder @ozankabak's opinion regarding
this change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]