jizezhang commented on issue #18195: URL: https://github.com/apache/datafusion/issues/18195#issuecomment-3543227177
Thank you. I did take a look at how file scan operator works. `DataSourceExec` opens a `FileStream` such that when polled, internally calls a file opener to open a file, e.g. `ParquetOpener::open`. It seems to me that majority of the logic on reading file is inside the future returned by `open`. Metadata seems to be loaded during physical planning for `TableScan`, which involves collecting statistics from metadata, and then cached. `ParquetOpener::open` returns a `ParquetRecordBatchStream`, and decoding of payload happens when the stream is polled (also inside `FileStream:poll_inner`)? In terms of tracking elapsed compute time, doe we want to create a `BaselineMetrics` instance and track inside `ParquetOpener::open`? But for decoding, how/where would we track that? It looks like currently we copy metrics from `ArrowReaderMetrics` https://github.com/apache/datafusion/blob/af2233675dbe8821cf388a5366e25268295ce034/datafusion/datasource-parquet/src/opener.rs#L485 which does not seem to track elapsed compute time. But I could have misunderstood and please let me know your thoughts. Another question I have is that when I tried file scan with csv, I also got an extremely small elapsed compute time in `ns`. Is it expected to be that small for file formats other than parquet, or that the metric is probably not tracked for file scan in general? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
