Re: [I] Improve metrics in `DataSourceExec` with `Parquet` source [datafusion]

via GitHub Mon, 17 Nov 2025 10:14:38 -0800


jizezhang commented on issue #18195:
URL: https://github.com/apache/datafusion/issues/18195#issuecomment-3543227177

Thank you. I did take a look at how file scan operator works.
`DataSourceExec` opens a `FileStream` such that when polled, internally calls a
file opener to open a file, e.g. `ParquetOpener::open`. It seems to me that
majority of the logic on reading file is inside the future returned by `open`.
Metadata seems to be loaded during physical planning for `TableScan`, which
involves collecting statistics from metadata, and then cached.
`ParquetOpener::open` returns a `ParquetRecordBatchStream`, and decoding of
payload happens when the stream is polled (also inside
`FileStream:poll_inner`)?

In terms of tracking elapsed compute time, doe we want to create a
`BaselineMetrics` instance and track inside `ParquetOpener::open`? But for
decoding, how/where would we track that? It looks like currently we copy
metrics from `ArrowReaderMetrics`

https://github.com/apache/datafusion/blob/af2233675dbe8821cf388a5366e25268295ce034/datafusion/datasource-parquet/src/opener.rs#L485
which does not seem to track elapsed compute time. But I could have
misunderstood and please let me know your thoughts.

Another question I have is that when I tried file scan with csv, I also got
an extremely small elapsed compute time in `ns`. Is it expected to be that
small for file formats other than parquet, or that the metric is probably not
tracked for file scan in general?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve metrics in `DataSourceExec` with `Parquet` source [datafusion]

Reply via email to