wesm opened a new pull request, #13630:
URL: https://github.com/apache/arrow/pull/13630

   This completes the porting to use ExecSpan everywhere. I also changed the 
ExecBatchIterator benchmarks to use ExecSpan to show the performance 
improvement in input splitting that we've talked about in the past:
   
   Splitting inputs into small ExecSpan:
   
   ```
   
------------------------------------------------------------------------------------
   Benchmark                          Time             CPU   Iterations 
UserCounters...
   
------------------------------------------------------------------------------------
   BM_ExecSpanIterator/1024      205671 ns       205667 ns         3395 
items_per_second=4.86223k/s
   BM_ExecSpanIterator/4096       54749 ns        54750 ns        13121 
items_per_second=18.265k/s
   BM_ExecSpanIterator/16384      15979 ns        15979 ns        42628 
items_per_second=62.5824k/s
   BM_ExecSpanIterator/65536       5597 ns         5597 ns       125099 
items_per_second=178.668k/s
   ```
   
   Splitting inputs into small ExecBatch:
   
   ```
   
-------------------------------------------------------------------------------------
   Benchmark                           Time             CPU   Iterations 
UserCounters...
   
-------------------------------------------------------------------------------------
   BM_ExecBatchIterator/1024    17163432 ns     17163171 ns           41 
items_per_second=58.2643/s
   BM_ExecBatchIterator/4096     4243467 ns      4243316 ns          163 
items_per_second=235.665/s
   BM_ExecBatchIterator/16384    1093680 ns      1093638 ns          620 
items_per_second=914.38/s
   BM_ExecBatchIterator/65536     272451 ns       272435 ns         2584 
items_per_second=3.6706k/s
   ```
   
   Because the input in this benchmark has 1M elements, this shows that 
splitting into 1024 chunks of size 1024 adds only 0.2ms of overhead with 
ExecSpanIterator versus 17.16ms of overhead with ExecBatchIterator (> 80x 
improvement). 
   
   This won't by itself do much to impact performance in Acero but things for 
the community to explore in the future are the following (this work that I've 
been doing has been a precondition to consider this):
   
   * A leaner ExecuteScalarExpression implementation that reuses temporary 
allocations (ARROW-16758)
   * Parallel expression evaluation
   * Better defining morsel (~1M elements) versus task (~1K elements) 
granularity in execution 
   * Work stealing so that we don't "hog" the thread pools, and we keep the 
work pinned to a particular CPU core if there are other things going on at the 
same time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to