zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3074225566
> DataFusion is underperforming the Polars streaming engine on some localhost join queries (1e8 rows of data on a Macbook M3 with 16GB of RAM): > > <img alt="Image" width="640" height="480" src="https://private-user-images.githubusercontent.com/2722395/463411874-045061e2-4ac5-4436-8d01-009dbb69ea41.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTI1OTQ4NjcsIm5iZiI6MTc1MjU5NDU2NywicGF0aCI6Ii8yNzIyMzk1LzQ2MzQxMTg3NC0wNDUwNjFlMi00YWM1LTQ0MzYtOGQwMS0wMDlkYmI2OWVhNDEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDcxNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA3MTVUMTU0OTI3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZDdmZjE5NWI4Y2Y2NGZmNzAzYTY1MDI1ZGZjNzcyNjVkNjJjYWZlNDZmOGRlMDc1NzZmNzdhZDdkNWI0ZTkwNSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.V_nQL3NMWu7xg6rwZlwCW-zRQrC17IbEBTuJ2n-rIFM"> > Here are the [join queries](https://github.com/apache/datafusion/blob/main/benchmarks/queries/h2o/join.sql). > > I am guessing the join operator can be optimized, similar to how the filtering and aggregation operations were optimized. > > Here is an example of how the median function was made faster: [#13550](https://github.com/apache/datafusion/issues/13550) > > See this epic for more info: [#13548](https://github.com/apache/datafusion/issues/13548) Does this compare result based parquet or csv format? Our h2o benchmark tool currently is used csv format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org