BlakeOrth commented on issue #17207:
URL: https://github.com/apache/datafusion/issues/17207#issuecomment-3192198520

   As I was reading through this I was feeling somewhat concerned about how 
easy it might be to accidentally miss calls that should be instrumented, 
however, I think the implementation suggestion completely solves any of those 
concerns.
   
   I do have one small concern here, which is the output may end up being very 
verbose, potentially to the point of being useless for human consumption in 
certain situations. An example situation might be something like a highly 
selective filter operating on string (binary) data for a large parquet dataset. 
Just the quantity of GET requests to fetch parquet pages could end up being 
overwhelming. I don't want to let "perfect" be the enemy of "good," so perhaps 
this is better suited to a follow-on PR, but I'm wondering if summary 
statistics might end up being generally more useful than individual request 
data.
   
   Leveraging the provided example a bit:
   ```sql
   > select count(*) from 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
   +----------+
   | count(*) |
   +----------+
   | 1000000  |
   +----------+
   1 row(s) fetched.
   Elapsed 3.579 seconds.
   
   -- This might format nicely as a table as well
   Object Store Statistics:
   GET:
     count: 5
     total_duration (ms): 1952
     min (ms): 85
     max (ms): 742
   HEAD:
     count: 57
     total_duration (ms): 529
     min (ms): 14
     max (ms): 75
   
   -- Perhaps this can be a "trace" mode?
   Object Store Requests
   2025-01-01T10:20:30 operation=LIST duration=0.050 
path=hits_compatible/athena_partitioned/hits_1.parquet
   2025-01-01T10:20:30 operation=GET duration=0.532 
path=hits_compatible/athena_partitioned/hits_1.parquet ranges="123..456" 
response_size=432342
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to