aocsa commented on pull request #11210:
URL: https://github.com/apache/arrow/pull/11210#issuecomment-940106966
Thanks @westonpace, I made some changed to enable , Thread Per Batch Mode
(async_mode=false) & Thread Per Operation Mode (async_mode=true).
As u mentioned before, this benchmark shows that spawning new thread tasks
don't reduced our core usage too much (The percentage difference vary between
1% - 5%), though I don't increase core usage either.
```
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark
Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
MinimalEndToEndBench/num_batches:100/batch_size:10/async_mode:1/real_time
3076857 ns 1061716 ns 225 bytes_per_second=1.54975M/s
items_per_second=32.5007k/s
MinimalEndToEndBench/num_batches:100/batch_size:10/async_mode:0/real_time
2987236 ns 1020221 ns 234 bytes_per_second=1.59625M/s
items_per_second=33.4758k/s
MinimalEndToEndBench/num_batches:100/batch_size:100/async_mode:1/real_time
3124765 ns 1075070 ns 224 bytes_per_second=15.2599M/s
items_per_second=32.0024k/s
MinimalEndToEndBench/num_batches:100/batch_size:100/async_mode:0/real_time
3028913 ns 1031486 ns 232 bytes_per_second=15.7428M/s
items_per_second=33.0151k/s
MinimalEndToEndBench/num_batches:100/batch_size:1000/async_mode:1/real_time
3013393 ns 1067201 ns 231 bytes_per_second=158.239M/s
items_per_second=33.1852k/s
MinimalEndToEndBench/num_batches:100/batch_size:1000/async_mode:0/real_time
2982961 ns 1052526 ns 232 bytes_per_second=159.854M/s
items_per_second=33.5237k/s
MinimalEndToEndBench/num_batches:100/batch_size:5000/async_mode:1/real_time
2996094 ns 1034481 ns 231 bytes_per_second=795.765M/s
items_per_second=33.3768k/s
MinimalEndToEndBench/num_batches:100/batch_size:5000/async_mode:0/real_time
2918542 ns 1022779 ns 238 bytes_per_second=816.91M/s
items_per_second=34.2637k/s
MinimalEndToEndBench/num_batches:1000/batch_size:10/async_mode:1/real_time
24135715 ns 7412783 ns 28 bytes_per_second=1.97565M/s
items_per_second=41.4324k/s
MinimalEndToEndBench/num_batches:1000/batch_size:10/async_mode:0/real_time
24077781 ns 7391717 ns 30 bytes_per_second=1.9804M/s
items_per_second=41.5321k/s
MinimalEndToEndBench/num_batches:1000/batch_size:100/async_mode:1/real_time
24011509 ns 7461978 ns 29 bytes_per_second=19.8587M/s
items_per_second=41.6467k/s
MinimalEndToEndBench/num_batches:1000/batch_size:100/async_mode:0/real_time
24196028 ns 7541171 ns 30 bytes_per_second=19.7072M/s
items_per_second=41.3291k/s
MinimalEndToEndBench/num_batches:1000/batch_size:1000/async_mode:1/real_time
23608108 ns 7296256 ns 28 bytes_per_second=201.98M/s
items_per_second=42.3583k/s
MinimalEndToEndBench/num_batches:1000/batch_size:1000/async_mode:0/real_time
23518460 ns 7353338 ns 29 bytes_per_second=202.75M/s
items_per_second=42.5198k/s
MinimalEndToEndBench/num_batches:1000/batch_size:5000/async_mode:1/real_time
26493311 ns 7815151 ns 27 bytes_per_second=899.92M/s
items_per_second=37.7454k/s
MinimalEndToEndBench/num_batches:1000/batch_size:5000/async_mode:0/real_time
26073649 ns 7780264 ns 27 bytes_per_second=914.404M/s
items_per_second=38.3529k/s
MinimalEndToEndBench/num_batches:10000/batch_size:10/async_mode:1/real_time
223436234 ns 70675962 ns 3 bytes_per_second=2.13411M/s
items_per_second=44.7555k/s
MinimalEndToEndBench/num_batches:10000/batch_size:10/async_mode:0/real_time
220303752 ns 70971486 ns 3 bytes_per_second=2.16445M/s
items_per_second=45.3919k/s
MinimalEndToEndBench/num_batches:10000/batch_size:100/async_mode:1/real_time
228117278 ns 70758735 ns 3 bytes_per_second=20.9032M/s
items_per_second=43.8371k/s
MinimalEndToEndBench/num_batches:10000/batch_size:100/async_mode:0/real_time
216247699 ns 70714074 ns 3 bytes_per_second=22.0505M/s
items_per_second=46.2433k/s
MinimalEndToEndBench/num_batches:10000/batch_size:1000/async_mode:1/real_time
228636211 ns 72468405 ns 3 bytes_per_second=208.557M/s
items_per_second=43.7376k/s
MinimalEndToEndBench/num_batches:10000/batch_size:1000/async_mode:0/real_time
225630993 ns 71128965 ns 3 bytes_per_second=211.335M/s
items_per_second=44.3202k/s
MinimalEndToEndBench/num_batches:10000/batch_size:5000/async_mode:1/real_time
248843462 ns 73516841 ns 3 bytes_per_second=958.107M/s
items_per_second=40.1859k/s
MinimalEndToEndBench/num_batches:10000/batch_size:5000/async_mode:0/real_time
239517144 ns 73220804 ns 3 bytes_per_second=995.413M/s
items_per_second=41.7507k/s
MinimalEndToEndBench/num_batches:100000/batch_size:10/async_mode:1/real_time
2122319750 ns 601241313 ns 1 bytes_per_second=2.24677M/s
items_per_second=47.1183k/s
MinimalEndToEndBench/num_batches:100000/batch_size:10/async_mode:0/real_time
2101197844 ns 587004432 ns 1 bytes_per_second=2.26936M/s
items_per_second=47.5919k/s
MinimalEndToEndBench/num_batches:100000/batch_size:100/async_mode:1/real_time
2128421476 ns 629112804 ns 1 bytes_per_second=22.4033M/s
items_per_second=46.9832k/s
MinimalEndToEndBench/num_batches:100000/batch_size:100/async_mode:0/real_time
2100455252 ns 629750934 ns 1 bytes_per_second=22.7016M/s
items_per_second=47.6087k/s
MinimalEndToEndBench/num_batches:100000/batch_size:1000/async_mode:1/real_time
2175799302 ns 629577506 ns 1 bytes_per_second=219.155M/s
items_per_second=45.9601k/s
MinimalEndToEndBench/num_batches:100000/batch_size:1000/async_mode:0/real_time
2128841169 ns 627977128 ns 1 bytes_per_second=223.989M/s
items_per_second=46.9739k/s
MinimalEndToEndBench/num_batches:100000/batch_size:5000/async_mode:1/real_time
2382353053 ns 651938913 ns 1 bytes_per_second=1000.77M/s
items_per_second=41.9753k/s
MinimalEndToEndBench/num_batches:100000/batch_size:5000/async_mode:0/real_time
2294256145 ns 638338052 ns 1 bytes_per_second=1039.2M/s
items_per_second=43.5871k/s
```
> This benchmark is very cool and we certainly need more benchmarks to
understand the overhead of ExecPlan. I'm not sure it explains the conbench
results however. The benchmark creates a Scan->Filter->Project plan and then
tests between serial & threaded.
>
> However, the conbench result isn't a difference between has_executor=true
and has_executor=false.
>
> The thing I most want to understand is the difference in performance
between "thread per batch" and "thread per operation". In other words, with
your change we now create 3 tasks for every batch in threaded mode. Before your
change we created 1 task for every batch.
>
> Can you add a flag to your benchmark that only controls whether the filter
& project steps create a new task or use the serial runner. There would be two
modes:
>
> Thread Per Batch Mode:
>
> * ExecPlan::executor is set
> * SourceNode creates tasks
> * FilterNode and ProjectNode user serial runner
>
> Thread Per Operation Mode:
>
> * ExecPlan::executor is set (it is set in both)
> * SourceNode creates tasks
> * FilterNode and ProjectNode user parallel runner and create tasks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]