Re: [PR] Reduce clone of `Statistics` in `ListingTable` and `PartitionedFile` [datafusion]

via GitHub Tue, 06 Aug 2024 11:44:50 -0700


Rachelint commented on PR #11802:
URL: https://github.com/apache/datafusion/pull/11802#issuecomment-2271924370


   > > @alamb finally I think I got the reason, it seems not the measurement 
noise for the long queries(such as q22 in clickbench)...
   > > The introduction for the `Arc<Statistic>` to `PartitionedFile` maybe 
actually make the long queries slower. The detail can see above, although the 
use of `Arc` can decrease the `instructions`, but it increase `bus-cycles`, and 
finally leads to the `higher cycles`(slower).
   > > I guess it is related to the `atomic` in the `Arc`, and when the amount 
of `PartitionedFile` becomes large, the cost of `atomic` becomes not trivial. 
But I am not sure, just a guess.
   > > I eliminate the `Arc<Statistic>` in `PartitionedFile` now for not 
hurting the long queries.
   > > The new benchmarks can see following.
   > 
   > I find it very strange that `Arc` in statistics should show up at all in 
the execution times -- I would expect a query that takes seconds to run would 
not look at the statistics once the query started and I would expect the actual 
processing time to dominate 🤔
   
   The statistic is actually not used when the execution started, I guess it 
may be due to the drop of `PartitionedFile` here (`PartitionedFile drop` -> 
`Arc<Statistic> drop` -> `atomic sub`).
   
   
https://github.com/apache/datafusion/blob/16a3557325eb8f949f4a87ab90c0a0b174dc8d86/datafusion/core/src/datasource/physical_plan/file_stream.rs#L291-L305
   
   Maybe a possible alternative worth trying in future: we take and drop the 
`Arc<Statistic>` in `PartitionedFile` before the actual execution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Reduce clone of `Statistics` in `ListingTable` and `PartitionedFile` [datafusion]

Reply via email to