tustvold commented on PR #2226: URL: https://github.com/apache/arrow-datafusion/pull/2226#issuecomment-1099300739
Ok as promised some benchmarks. It should be noted these come with some pretty big disclaimers: * Until we make changes to `ExecutionPlan`, the scheduler cannot introduce additional parallelism within a partition, as it is constrained by the current pull-based interface. Removing this will be a key performance unlock * The Parquet SQL benchmarks are massively dominated by the parquet performance, which may not be representative of all query workloads * Currently DataFusion uses `tokio::spawn_blocking` in the `tokio` case. Aside from this giving tokio more threads to play with, it also results in perfect thread-locality for the parquet decoder. I have therefore collected results with and without this enabled * My focus thus far has been to get something working, and not to squeeze out as much performance as possible, there is likely lots that could be improved That all being said, in like-for-like comparisons (i.e. without spawn blocking) we are actually seeing a slight performance improvement from the scheduler. I've not looked into why this is, but the only thing I can think of that might have improved performance is the switch to use rayon, everything else is either the same or would make it slower. * [Without Spawn Blocking.txt](https://github.com/apache/arrow-datafusion/files/8490374/Without.Spawn.Blocking.txt) * [With Spawn Blocking.txt](https://github.com/apache/arrow-datafusion/files/8490375/With.Spawn.Blocking.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
