alamb opened a new issue, #17207: URL: https://github.com/apache/datafusion/issues/17207
### Is your feature request related to a problem or challenge? DataFusion makes many IO requests Optimizing performance for DataFusion often involves reviewing the IO patterns, especially for remote storage like AWS S3, and optimizing first requires understanding the pattern. We often use `datafusion-cli` to debug code that comes with DataFusion, such as the [`ListingTable`](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html) even though many (most?) users don't use `datafusion-cli` directly @BlakeOrth [sums the usecase nicely](https://github.com/apache/datafusion/issues/16365#issuecomment-3189937566) here > I'm ultimately an API user, not a CLI user. I've been using a hacky-instrumented CLI here to help give a common tool and example(s) of potential improvements. > Exposing additional metrics around where DataFusion is spending its time at the API level (and in turn through the CLI) does seem very useful to me though. > I personally had to rely on a mix of production metrics for our object storage, doing off-cpu-time profiling, and the aforementioned hacked in timing instrumentation, to help me understand that listing files and collecting their object metadata was taking a non-trivial amount of time... ### Describe the solution you'd like I would like some way to easily show the object store operations made by datafusion-cli, when some of profiling is enabled. ### Describe alternatives you've considered ## UX Suggestion I suggest we take inspiration from @kosiew and https://github.com/apache/datafusion/pull/17021 -- specifically something like ```sql -- Enable profiling. Any subsequent query will run as normal and then print out a object_store requests trace \object_store_profiling ... -- Disable profiling (toggle) \object_store_profiling ``` The output should have relevant information for each object store request. Relevant information (maybe there is more) * Start time: timestamp e.g 2025-01-01T10:20:30 * operation: what object store operation (e.g. `GET`, `LIST`) * request ranges: what was requested (e.g. `1000..2000`) * response_size: what size object was returned, in bytes (e.g. `1000000` for 1MB) * path: What path was requested * duration: how long did the operation take (e.g. `0.500` for 500ms) ```sql > select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row(s) fetched. Elapsed 3.579 seconds. Object Store Requests 2025-01-01T10:20:30 operation=LIST duration=0.050 path=hits_compatible/athena_partitioned/hits_1.parquet 2025-01-01T10:20:30 operation=GET duration=0.532 path=hits_compatible/athena_partitioned/hits_1.parquet ranges="123..456" response_size=432342 ``` Bonus points (maybe a follow on PR) for a more machine readable format (like JSON) ## Implementation Suggestion Using the ObjectStore API, you can wrap an `ObjectStore` instance and intercept all IO requests and instrument / observe them however you want, for example the `InstrumentedObjectStore` in ```rust /// A wrapper around an `ObjectStore` that instruments all public methods with tracing. #[derive(Clone, Debug)] struct InstrumentedObjectStore { inner: Arc<dyn ObjectStore>, name: String, } ``` https://github.com/datafusion-contrib/datafusion-tracing/blob/8fc214b192ad67114742f8a582292bc06e2b5247/instrumented-object-store/src/instrumented_object_store.rs#L138 from @geoffreyclaude and @gabotechs I recommend a similar approach in `datafusion-cli` -- when profiling is desired, use a wrapped `ObjectStore` that saves the requests for subsequent displayBonu ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org