2010YOUY01 commented on issue #18195:
URL: https://github.com/apache/datafusion/issues/18195#issuecomment-3544934902

   > Thank you. I did take a look at how file scan operator works. 
`DataSourceExec` opens a `FileStream` such that when polled, internally calls a 
file opener to open a file, e.g. `ParquetOpener::open`. It seems to me that 
majority of the logic on reading file is inside the future returned by `open`. 
Metadata seems to be loaded during physical planning for `TableScan`, which 
involves collecting statistics from metadata, and then cached. 
`ParquetOpener::open` returns a `ParquetRecordBatchStream`, and decoding of 
payload happens when the stream is polled (also inside `FileStream:poll_inner`)?
   > 
   > In terms of tracking elapsed compute time, doe we want to create a 
`BaselineMetrics` instance and track inside `ParquetOpener::open`? But for 
decoding, how/where would we track that? It looks like currently we copy 
metrics from `ArrowReaderMetrics` 
https://github.com/apache/arrow-rs/blob/ca4a0ae5e4122e905686f3b7538b5308503cb770/parquet/src/arrow/arrow_reader/metrics.rs#L40
 which does not seem to track elapsed compute time. But I could have 
misunderstood and please let me know your thoughts.
   
   This high level idea makes sense to me, but I'm not sure about the details 
yet, it’s on my list to look into later.
   
   > Another question I have is that when I tried file scan with csv, I also 
got an extremely small elapsed compute time in `ns`. Is it expected to be that 
small for file formats other than parquet, or that the metric is probably not 
tracked for file scan in general?
   
   Yes, this is not reasonable, thank you for the investigations. Do you have 
time to file an issue like this one? If not I'm happy to do so.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to