alamb opened a new issue, #17207:
URL: https://github.com/apache/datafusion/issues/17207

   ### Is your feature request related to a problem or challenge?
   
   DataFusion makes many IO requests
   
   Optimizing performance for DataFusion often involves reviewing the IO 
patterns, especially for remote storage like AWS S3, and optimizing first 
requires understanding the pattern. 
   
   We often  use `datafusion-cli` to debug code that comes with DataFusion, 
such as the 
[`ListingTable`](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html)
 even though many (most?) users don't use `datafusion-cli` directly
   
   @BlakeOrth [sums the usecase 
nicely](https://github.com/apache/datafusion/issues/16365#issuecomment-3189937566)
 here 
   
   > I'm ultimately an API user, not a CLI user. I've been using a 
hacky-instrumented CLI here to help give a common tool and example(s) of 
potential improvements.  
   
   > Exposing additional metrics around where DataFusion is spending its time 
at the API level (and in turn through the CLI) does seem very useful to me 
though. 
   
   > I personally had to rely on a mix of production metrics for our object 
storage, doing off-cpu-time profiling, and the aforementioned hacked in timing 
instrumentation, to help me understand that listing files and collecting their 
object metadata was taking a non-trivial amount of time...
   
   
   
   
   ### Describe the solution you'd like
   
   I would like some way to easily show the object store operations made by 
datafusion-cli, when some of profiling is enabled. 
   
   ### Describe alternatives you've considered
   
   ## UX Suggestion
   
   I suggest we take inspiration from @kosiew and 
https://github.com/apache/datafusion/pull/17021 -- specifically something like
   ```sql
   -- Enable profiling. Any subsequent query will run as normal and then print 
out a object_store requests trace
   \object_store_profiling
   ...
   
   -- Disable profiling (toggle)
   \object_store_profiling
   ```
   
   The output should have relevant information for each object store request. 
Relevant information (maybe there is more)
   * Start time: timestamp e.g 2025-01-01T10:20:30
   * operation: what object store operation (e.g. `GET`, `LIST`)
   * request ranges: what was requested (e.g. `1000..2000`)
   * response_size: what size object was returned, in bytes (e.g. `1000000` for 
1MB)
   * path: What path was requested
   * duration: how long did the operation take (e.g. `0.500` for 500ms)
   
   ```sql
   > select count(*) from 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
   +----------+
   | count(*) |
   +----------+
   | 1000000  |
   +----------+
   1 row(s) fetched.
   Elapsed 3.579 seconds.
   
   Object Store Requests
   2025-01-01T10:20:30 operation=LIST duration=0.050 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-01-01T10:20:30 operation=GET duration=0.532 
path=hits_compatible/athena_partitioned/hits_1.parquet ranges="123..456" 
response_size=432342
   
   ```
   
   Bonus points (maybe a follow on PR) for a more machine readable format (like 
JSON)
   
   ## Implementation Suggestion
   
   Using the ObjectStore API, you can wrap an `ObjectStore` instance and  
intercept all IO requests and instrument / observe them however you want, for 
example the `InstrumentedObjectStore` in 
   
   ```rust
   /// A wrapper around an `ObjectStore` that instruments all public methods 
with tracing.
   #[derive(Clone, Debug)]
   struct InstrumentedObjectStore {
       inner: Arc<dyn ObjectStore>,
       name: String,
   }
   ```
   
   
https://github.com/datafusion-contrib/datafusion-tracing/blob/8fc214b192ad67114742f8a582292bc06e2b5247/instrumented-object-store/src/instrumented_object_store.rs#L138
  from @geoffreyclaude  and @gabotechs 
   
   I recommend a similar approach in `datafusion-cli` -- when profiling is 
desired, use a wrapped `ObjectStore` that saves the requests for subsequent 
displayBonu
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to