jayzhan211 commented on issue #13099: URL: https://github.com/apache/datafusion/issues/13099#issuecomment-2525483850
> > > @alamb I would really appreciate any advice you could give when you have a moment. > > > > > > I think we would have to get some detailed profiling to really know for sure, but I suspect that ClickBench has non trivial caches (buffer caching, page caches, etc) > > DataFusion, as a serverless engine, does not have any such caching (the only difference between cold/hot run is that on the hot run, data from disk will be in the Linux page cache (so may not do any actual IO) > > It might also help to break down which queries showed the biggest discrepancy -- were they queries that already ran in 100ms (in which case caching , avoiding re-reading metadata might be a bigger part of processing) > > After conducting more experiments, I made some unexpected discoveries: > > In the public clickbench results, Clickhouse was using a version newer than 24.11, while our server had 24.1/24.3 installed. Therefore, I re-ran the benchmark using the latest version 24.12, and this time, the results were similar to those on the clickbench website - Datafusion was faster than Clickhouse in both cold run and hot run phases, and these results were consistently reproducible. This means that recent updates to Clickhouse have led to a decline in its query performance for parquet files. In the earlier versions, Clickhouse still had better performance during the hot run phase. > > @alamb FYI Do you know which queries are we still lag behind in the old version of clickhouse? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
