Rachelint commented on PR #11802: URL: https://github.com/apache/datafusion/pull/11802#issuecomment-2271924370
> > @alamb finally I think I got the reason, it seems not the measurement noise for the long queries(such as q22 in clickbench)... > > The introduction for the `Arc<Statistic>` to `PartitionedFile` maybe actually make the long queries slower. The detail can see above, although the use of `Arc` can decrease the `instructions`, but it increase `bus-cycles`, and finally leads to the `higher cycles`(slower). > > I guess it is related to the `atomic` in the `Arc`, and when the amount of `PartitionedFile` becomes large, the cost of `atomic` becomes not trivial. But I am not sure, just a guess. > > I eliminate the `Arc<Statistic>` in `PartitionedFile` now for not hurting the long queries. > > The new benchmarks can see following. > > I find it very strange that `Arc` in statistics should show up at all in the execution times -- I would expect a query that takes seconds to run would not look at the statistics once the query started and I would expect the actual processing time to dominate 🤔 The statistic is actually not used when the execution started, I guess it may be due to the drop of `PartitionedFile` here (`PartitionedFile drop` -> `Arc<Statistic> drop` -> `atomic sub`). https://github.com/apache/datafusion/blob/16a3557325eb8f949f4a87ab90c0a0b174dc8d86/datafusion/core/src/datasource/physical_plan/file_stream.rs#L291-L305 Maybe a possible alternative worth trying in future: we take and drop the `Arc<Statistic>` in `PartitionedFile` before the actual execution? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
