[ https://issues.apache.org/jira/browse/PARQUET-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787346#comment-17787346 ]
ASF GitHub Bot commented on PARQUET-2374: ----------------------------------------- steveloughran commented on PR #1187: URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816906884 it'd be really nice if somehow there was a way to push hadoop stream IOStats here, especially the counters, min, max and mean maps: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/iostatistics.html and its really interesting for s3, azure and gcs clients, where we collect stream specific stuff, including things like: bytes discarded in seek, time for GET, whether we did a HEAD first, and more. These are collected in a thread level, but also include stats from helper threads such as those in async stream draining, vector IO... It'd take a move to hadoop 3.3.1+ to embrace the API, but if there was a way for something to publish stats to your metric collector, then maybe something could be done Tip: you can enable a dump of a filesystem's aggregate stats in process shutdown for azure and s3a ``` fs.iostatistics.logging.level=info ``` ``` 2023-11-17 18:30:28,634 [shutdown-hook-0] INFO statistics.IOStatisticsLogging (IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics: counters=((action_http_head_request=3) (audit_request_execution=15) (audit_span_creation=12) (object_list_request=12) (object_metadata_request=3) (op_get_file_status=1) (op_glob_status=1) (op_list_status=9) (store_io_request=15)); gauges=(); minimums=((action_http_head_request.min=22) (object_list_request.min=25) (op_get_file_status.min=1) (op_glob_status.min=9) (op_list_status.min=25)); maximums=((action_http_head_request.max=41) (object_list_request.max=398) (op_get_file_status.max=1) (op_glob_status.max=9) (op_list_status.max=408)); means=((action_http_head_request.mean=(samples=3, sum=87, mean=29.0000)) (object_list_request.mean=(samples=12, sum=708, mean=59.0000)) (op_get_file_status.mean=(samples=1, sum=1, mean=1.0000)) (op_glob_status.mean=(samples=1, sum=9, mean=9.0000)) (op_list_status.mean=(samples=9, sum=814, mean=90.4444))); ``` > Add metrics support for parquet file reader > ------------------------------------------- > > Key: PARQUET-2374 > URL: https://issues.apache.org/jira/browse/PARQUET-2374 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Affects Versions: 1.13.1 > Reporter: Parth Chandra > Priority: Major > > ParquetFileReader is used by many engines - Hadoop, Spark among them. These > engines report various metrics to measure performance in different > environments and it is usually useful to be able to get low level metrics out > of the file reader and writers. > It would be very useful to allow a simple interface to report the metrics. > Callers can then implement the interface to record the metrics in any > subsystem they choose. -- This message was sent by Atlassian Jira (v8.20.10#820010)