aocsa commented on pull request #11210: URL: https://github.com/apache/arrow/pull/11210#issuecomment-938339174
Thanks @weston, I rebased this PR and addressed latest feedback. Moreover I ran some benchmarks to see the impact of: 1. the possible issue with ExecBatch copies; 2. async mode execution. Note: I runned the benchmark with the [following code ](https://github.com/apache/arrow/blob/3136a9babc8ba5fba55d35b88c7e5a967d4c01e8/cpp/src/arrow/dataset/scanner_benchmark.cc) and with this machine configuration: ``` Run on (16 X 3600 MHz CPU s) with 32Gb RAM CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 512 KiB (x8) L3 Unified 16384 KiB (x2) ``` 1.1 Sync mode with the lambda task function capturing [batches by copy](https://github.com/apache/arrow/blob/85af59892b83fb49af58c2919d98853b9c1779fd/cpp/src/arrow/compute/exec/exec_plan.h#L311) ``` Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------------- MinimalEndToEndBench/100/10/min_time:1.000 3.46 ms 1.71 ms 812 items_per_second=586.126/s MinimalEndToEndBench/1000/100/min_time:1.000 52.0 ms 38.0 ms 36 items_per_second=26.3494/s MinimalEndToEndBench/10000/100/min_time:1.000 1102 ms 997 ms 2 items_per_second=1.00278/s MinimalEndToEndBench/10000/1000/min_time:1.000 4752 ms 4644 ms 1 items_per_second=0.215319/s ``` 1.2 Sync mode with the lambda task function capturing batches by move [with std::bind ](https://gist.github.com/aocsa/9da1f32ae1c36c133316e32d84711bc3) ``` --------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------------------- MinimalEndToEndBench/100/10/min_time:1.000 3.02 ms 1.62 ms 885 items_per_second=617.506/s MinimalEndToEndBench/1000/100/min_time:1.000 51.7 ms 37.9 ms 35 items_per_second=26.4141/s MinimalEndToEndBench/10000/100/min_time:1.000 1132 ms 1041 ms 1 items_per_second=0.961052/s MinimalEndToEndBench/10000/1000/min_time:1.000 4795 ms 4680 ms 1 items_per_second=0.213687/s ``` As you can see ExecBatch copies were not the culprit of worst performance in some queries because there aren't "that" many ExecBatch instances and the copy is cheap (most of the case just references). 2. Execution without/with ThreadPool ``` without threadpool ------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------ MinimalEndToEndBench/100/10/0/min_time:1.000 2.33 ms 2.33 ms 601 items_per_second=428.441/s MinimalEndToEndBench/1000/100/0/min_time:1.000 46.8 ms 46.8 ms 30 items_per_second=21.3571/s MinimalEndToEndBench/10000/100/0/min_time:1.000 1172 ms 1172 ms 1 items_per_second=0.853482/s MinimalEndToEndBench/10000/1000/0/min_time:1.000 4906 ms 4905 ms 1 items_per_second=0.203876/s MinimalEndToEndBench/10000/10000/0/min_time:1.000 52141 ms 52129 ms 1 items_per_second=0.0191832/s with threadpool MinimalEndToEndBench/100/10/1/min_time:1.000 3.87 ms 1.87 ms 745 items_per_second=533.584/s MinimalEndToEndBench/1000/100/1/min_time:1.000 54.1 ms 38.3 ms 37 items_per_second=26.09/s MinimalEndToEndBench/10000/100/1/min_time:1.000 1153 ms 1010 ms 1 items_per_second=0.990056/s MinimalEndToEndBench/10000/1000/1/min_time:1.000 4771 ms 4624 ms 1 items_per_second=0.216249/s MinimalEndToEndBench/10000/10000/1/min_time:1.000 49761 ms 49578 ms 1 items_per_second=0.0201702/s ``` As you can see performance when the workload is small is worst in async mode but when the workload is huge the performance tends to be better. Looking forward your thoughts. cc @felipeblazing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
