steveloughran commented on PR #1187: URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816906884
it'd be really nice if somehow there was a way to push hadoop stream IOStats here, especially the counters, min, max and mean maps: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/iostatistics.html and its really interesting for s3, azure and gcs clients, where we collect stream specific stuff, including things like: bytes discarded in seek, time for GET, whether we did a HEAD first, and more. These are collected in a thread level, but also include stats from helper threads such as those in async stream draining, vector IO... It'd take a move to hadoop 3.3.1+ to embrace the API, but if there was a way for something to publish stats to your metric collector, then maybe something could be done Tip: you can enable a dump of a filesystem's aggregate stats in process shutdown for azure and s3a ``` fs.iostatistics.logging.level=info ``` ``` 2023-11-17 18:30:28,634 [shutdown-hook-0] INFO statistics.IOStatisticsLogging (IOStatisticsLogging.java:logIOStatisticsAtLevel(269)) - IOStatistics: counters=((action_http_head_request=3) (audit_request_execution=15) (audit_span_creation=12) (object_list_request=12) (object_metadata_request=3) (op_get_file_status=1) (op_glob_status=1) (op_list_status=9) (store_io_request=15)); gauges=(); minimums=((action_http_head_request.min=22) (object_list_request.min=25) (op_get_file_status.min=1) (op_glob_status.min=9) (op_list_status.min=25)); maximums=((action_http_head_request.max=41) (object_list_request.max=398) (op_get_file_status.max=1) (op_glob_status.max=9) (op_list_status.max=408)); means=((action_http_head_request.mean=(samples=3, sum=87, mean=29.0000)) (object_list_request.mean=(samples=12, sum=708, mean=59.0000)) (op_get_file_status.mean=(samples=1, sum=1, mean=1.0000)) (op_glob_status.mean=(samples=1, sum=9, mean=9.0000)) (op_list_status.mean=(samples=9, sum=814, mean=90.4444))); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org