Dandandan commented on issue #17494: URL: https://github.com/apache/datafusion/issues/17494#issuecomment-3566780230
> When running TPC-DS q72, I've noticed that regardless of the underlying file format, latency increases dramatically even with relatively modest scale factors like 10. I've measured the query at around 2.4 seconds with SF=1, but over 60s when SF=10. > > When running in my benchmarking setup, the plan is (as you can see - its extremely join heavy) [here](https://gist.github.com/AdamGS/cea5816b321ca70323975c05d6048f36). > > Profiling the query using samply (This is with `branch-50` over parquet, SF=1): <img alt="Image" width="2000" height="1350" src="https://private-user-images.githubusercontent.com/6495257/487409625-8449ce83-cc22-48cd-942b-68f705766f22.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NjM4MjM3NTksIm5iZiI6MTc2MzgyMzQ1OSwicGF0aCI6Ii82NDk1MjU3LzQ4NzQwOTYyNS04NDQ5Y2U4My1jYzIyLTQ4Y2QtOTQyYi02OGY3MDU3NjZmMjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MTEyMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTExMjJUMTQ1NzM5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9N2I3YzU0NDU2NTM2NjljZDk3YTY1NDEzNzlhM2FiNzc4N2Y1ZDU5MjM5Yjg5YjU2ZTI3YTcxZDE2NmRhODk5ZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.GHemU1_yxnn03R1JGxIgeytvpZ-Vuzax5NOUORU_mqI"> > > By playing around with it, seems like most time is spent in the loop inside the `chain_traverse` macro. I've tried a few common performance techniques - making it an explicitly inlined generic function, changing how the indices and values memory is managed/written to, but nothing made a noticeable difference. One thing that seems off to me in this flamegraph is the amount spent in `Vec::push`. The capacity of the Vec is set to the rows limit, so push should be pretty inexpensive? Or is the profile showing misleading info? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
