tustvold commented on PR #2226:
URL: 
https://github.com/apache/arrow-datafusion/pull/2226#issuecomment-1099300739

   Ok as promised some benchmarks. It should be noted these come with some 
pretty big disclaimers:
   
   * Until we make changes to `ExecutionPlan`, the scheduler cannot introduce 
additional parallelism within a partition, as it is constrained by the current 
pull-based interface. Removing this will be a key performance unlock
   * The Parquet SQL benchmarks are massively dominated by the parquet 
performance, which may not be representative of all query workloads
   * Currently DataFusion uses `tokio::spawn_blocking` in the `tokio` case. 
Aside from this giving tokio more threads to play with, it also results in 
perfect thread-locality for the parquet decoder. I have therefore collected 
results with and without this enabled
   * My focus thus far has been to get something working, and not to squeeze 
out as much performance as possible, there is likely lots that could be improved
   
   That all being said, in like-for-like comparisons (i.e. without spawn 
blocking) we are actually seeing a slight performance improvement from the 
scheduler. I've not looked into why this is, but the only thing I can think of 
that might have improved performance is the switch to use rayon, everything 
else is either the same or would make it slower.
   
   * [Without Spawn 
Blocking.txt](https://github.com/apache/arrow-datafusion/files/8490374/Without.Spawn.Blocking.txt)
   * [With Spawn 
Blocking.txt](https://github.com/apache/arrow-datafusion/files/8490375/With.Spawn.Blocking.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to