[GitHub] [arrow] aocsa commented on pull request #11210: ARROW-13576: [C++] Replace ExecNode::InputReceived with ::MakeTask

GitBox Thu, 07 Oct 2021 21:43:51 -0700


aocsa commented on pull request #11210:
URL: https://github.com/apache/arrow/pull/11210#issuecomment-938339174



   Thanks @weston, I rebased this PR and addressed latest feedback. Moreover I 
ran some benchmarks to see the impact of: 1. the possible issue with ExecBatch 
copies; 2. async mode execution.  
   
   Note: I runned the benchmark with the [following code 
](https://github.com/apache/arrow/blob/3136a9babc8ba5fba55d35b88c7e5a967d4c01e8/cpp/src/arrow/dataset/scanner_benchmark.cc)
 and with this machine configuration:
   
   ```
   Run on (16 X 3600 MHz CPU s) with 32Gb RAM
   CPU Caches:
     L1 Data 32 KiB (x8)
     L1 Instruction 32 KiB (x8)
     L2 Unified 512 KiB (x8)
     L3 Unified 16384 KiB (x2)
   ```
   
   1.1 Sync mode with the lambda task function capturing [batches by 
copy](https://github.com/apache/arrow/blob/85af59892b83fb49af58c2919d98853b9c1779fd/cpp/src/arrow/compute/exec/exec_plan.h#L311)
   ```
   Benchmark                                               Time             CPU 
  Iterations UserCounters...
   
---------------------------------------------------------------------------------------------------------
   MinimalEndToEndBench/100/10/min_time:1.000           3.46 ms         1.71 ms 
         812 items_per_second=586.126/s
   MinimalEndToEndBench/1000/100/min_time:1.000         52.0 ms         38.0 ms 
          36 items_per_second=26.3494/s
   MinimalEndToEndBench/10000/100/min_time:1.000        1102 ms          997 ms 
           2 items_per_second=1.00278/s
   MinimalEndToEndBench/10000/1000/min_time:1.000       4752 ms         4644 ms 
           1 items_per_second=0.215319/s
   ```
   1.2 Sync mode with the lambda task function capturing batches by move [with 
std::bind ](https://gist.github.com/aocsa/9da1f32ae1c36c133316e32d84711bc3) 
   ```
   
---------------------------------------------------------------------------------------------------------
   Benchmark                                               Time             CPU 
  Iterations UserCounters...
   
---------------------------------------------------------------------------------------------------------
   MinimalEndToEndBench/100/10/min_time:1.000           3.02 ms         1.62 ms 
         885 items_per_second=617.506/s
   MinimalEndToEndBench/1000/100/min_time:1.000         51.7 ms         37.9 ms 
          35 items_per_second=26.4141/s
   MinimalEndToEndBench/10000/100/min_time:1.000        1132 ms         1041 ms 
           1 items_per_second=0.961052/s
   MinimalEndToEndBench/10000/1000/min_time:1.000       4795 ms         4680 ms 
           1 items_per_second=0.213687/s
   ```
    
   As you can see ExecBatch copies were not the culprit of worst performance in 
some queries because there aren't "that" many ExecBatch instances and the copy 
is cheap (most of the case just references). 
   
   2.  Execution without/with ThreadPool
   ```
   without threadpool
   
------------------------------------------------------------------------------------------------------------
   Benchmark                                                  Time             
CPU   Iterations UserCounters...
   
------------------------------------------------------------------------------------------------------------
   MinimalEndToEndBench/100/10/0/min_time:1.000            2.33 ms         2.33 
ms          601 items_per_second=428.441/s
   MinimalEndToEndBench/1000/100/0/min_time:1.000          46.8 ms         46.8 
ms           30 items_per_second=21.3571/s
   MinimalEndToEndBench/10000/100/0/min_time:1.000         1172 ms         1172 
ms            1 items_per_second=0.853482/s
   MinimalEndToEndBench/10000/1000/0/min_time:1.000        4906 ms         4905 
ms            1 items_per_second=0.203876/s
   MinimalEndToEndBench/10000/10000/0/min_time:1.000      52141 ms        52129 
ms            1 items_per_second=0.0191832/s
   
   with threadpool
   MinimalEndToEndBench/100/10/1/min_time:1.000            3.87 ms         1.87 
ms          745 items_per_second=533.584/s
   MinimalEndToEndBench/1000/100/1/min_time:1.000          54.1 ms         38.3 
ms           37 items_per_second=26.09/s
   MinimalEndToEndBench/10000/100/1/min_time:1.000         1153 ms         1010 
ms            1 items_per_second=0.990056/s
   MinimalEndToEndBench/10000/1000/1/min_time:1.000        4771 ms         4624 
ms            1 items_per_second=0.216249/s
   MinimalEndToEndBench/10000/10000/1/min_time:1.000      49761 ms        49578 
ms            1 items_per_second=0.0201702/s
   ```
   As you can see performance when the workload is small is worst in async mode 
but when the workload is huge the performance tends to be better.
    
   
   Looking forward your thoughts. cc @felipeblazing
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] aocsa commented on pull request #11210: ARROW-13576: [C++] Replace ExecNode::InputReceived with ::MakeTask

Reply via email to