jaylmiller commented on issue #5230: URL: https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1453611831
> > like > 512 > > How common are such batches in practice? I guess I'm wondering if the added complexity is justified for what is effectively a degenerate case that will cause issues far beyond just for sort? > > _Btw DynComparator has known issues w.r.t sorting nulls, and I had hoped to eventually deprecate and remove it_ - [apache/arrow-rs#2687](https://github.com/apache/arrow-rs/issues/2687) No 512 is way too small @tustvold . So for the sort bench, we are seeing regression when the `execute` call is sorting a single batch of size 12500 (total benchmark input size is 100000, broken up into 8 partitions), this occurs when partitioning is preserved since each partition is sorted separately. When partitioning is not preserved, and all batches are sorted together, we see significant perf improvements. Additionally when partitioning is preserved, but the input data is all skewed to a single partition, we see the same perf improvement (as expected). Here are the bench results for each of those scenarios: ``` group main-sort rows-sort ----- --------- --------- sort mixed tuple 1.00 29.5±2.83ms ? ?/sec 1.04 30.5±3.23ms ? ?/sec sort mixed tuple preserve partitioning 1.00 4.7±0.94ms ? ?/sec 1.52 7.1±0.64ms ? ?/sec sort mixed tuple preserve partitioning data skewed to first 1.00 30.6±4.78ms ? ?/sec 1.00 30.6±6.66ms ? ?/sec sort mixed utf8 dictionary tuple 2.60 60.8±13.04ms ? ?/sec 1.00 23.4±0.93ms ? ?/sec sort mixed utf8 dictionary tuple preserve partitioning 1.00 4.5±1.27ms ? ?/sec 1.11 5.1±0.40ms ? ?/sec sort mixed utf8 dictionary tuple preserve partitioning data skewed to first 2.24 54.0±4.22ms ? ?/sec 1.00 24.1±2.17ms ? ?/sec sort utf8 dictionary tuple 2.32 54.7±7.35ms ? ?/sec 1.00 23.6±3.48ms ? ?/sec sort utf8 dictionary tuple preserve partitioning 1.00 3.7±0.37ms ? ?/sec 1.24 4.6±0.38ms ? ?/sec sort utf8 dictionary tuple preserve partitioning data skewed to first 2.50 54.1±5.52ms ? ?/sec 1.00 21.6±0.65ms ? ?/sec sort utf8 tuple 1.79 62.5±13.08ms ? ?/sec 1.00 35.0±1.62ms ? ?/sec sort utf8 tuple preserve partitioning 1.00 7.3±0.79ms ? ?/sec 1.17 8.6±0.74ms ? ?/sec sort utf8 tuple preserve partitioning data skewed to first 1.54 54.5±5.11ms ? ?/sec 1.00 35.4±2.18ms ? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
