[ 
https://issues.apache.org/jira/browse/PARQUET-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787346#comment-17787346
 ] 

ASF GitHub Bot commented on PARQUET-2374:
-----------------------------------------

steveloughran commented on PR #1187:
URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816906884

   it'd be really nice if somehow there was a way to push hadoop stream IOStats 
here, especially the counters, min, max and mean maps: 
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/iostatistics.html
   
   and its really interesting for s3, azure and gcs clients, where we collect 
stream specific stuff, including things like: bytes discarded in seek, time for 
GET, whether we did a HEAD first, and more. These are collected in a thread 
level, but also include stats from helper threads such as those in async stream 
draining, vector IO...
   
   It'd take a move to hadoop 3.3.1+ to embrace the API, but if there was a way 
for something to publish stats to your metric collector, then maybe something 
could be done
   
   Tip: you can enable a dump of a filesystem's aggregate stats in process 
shutdown for azure and s3a
   ```
   fs.iostatistics.logging.level=info
   ```
   
   ```
   2023-11-17 18:30:28,634 [shutdown-hook-0] INFO  
statistics.IOStatisticsLogging 
(IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics: 
counters=((action_http_head_request=3)
   (audit_request_execution=15)
   (audit_span_creation=12)
   (object_list_request=12)
   (object_metadata_request=3)
   (op_get_file_status=1)
   (op_glob_status=1)
   (op_list_status=9)
   (store_io_request=15));
   
   gauges=();
   
   minimums=((action_http_head_request.min=22)
   (object_list_request.min=25)
   (op_get_file_status.min=1)
   (op_glob_status.min=9)
   (op_list_status.min=25));
   
   maximums=((action_http_head_request.max=41)
   (object_list_request.max=398)
   (op_get_file_status.max=1)
   (op_glob_status.max=9)
   (op_list_status.max=408));
   
   means=((action_http_head_request.mean=(samples=3, sum=87, mean=29.0000))
   (object_list_request.mean=(samples=12, sum=708, mean=59.0000))
   (op_get_file_status.mean=(samples=1, sum=1, mean=1.0000))
   (op_glob_status.mean=(samples=1, sum=9, mean=9.0000))
   (op_list_status.mean=(samples=9, sum=814, mean=90.4444)));
   ```
   




> Add metrics support for parquet file reader
> -------------------------------------------
>
>                 Key: PARQUET-2374
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2374
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.13.1
>            Reporter: Parth Chandra
>            Priority: Major
>
> ParquetFileReader is used by many engines - Hadoop, Spark among them. These 
> engines report various metrics to measure performance in different 
> environments and it is usually useful to be able to get low level metrics out 
> of the file reader and writers.
> It would be very useful to allow a simple interface to report the metrics. 
> Callers can then implement the interface to record the metrics in any 
> subsystem they choose.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to