parthchandra commented on PR #1187: URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816986727
@steveloughran I did look into leveraging Hadoop io stats but my first attempt did not work too well and I thought a simpler initial implementation would be more useful. Once we move to hadoop vector io, I'll take another stab at it. > What would be good if this stats was set up to > > take maps of key-value rather than a fixed enum The fixed enum here is simply the Parquet file reader providing information that these are the values it knows about. This implementation is not really collecting and aggregating anything, it is simply recording the time and counts and passing them on. > collect those min/mean/max as well as counts. The implementation of the parquet metrics callback will do that. So if the execution engine is Spark, it can simply get the values and add them to it's own metrics collection subsystem which then computes the min/max/mean. > somehow provided a plugin point where we could add something to add any of the parquet reader/writer stats to the thread context -trying to collect stats from inside wrapped-many-times-over streams and iterators is way too complex. I know, i have a branch of parquet where I tried that... Hmm, that will take some work. I wanted to measure streaming decompression time (where the `decompress` call simply returns a stream which is decompressed as it is read), but found it required too many breaking changes to implement. But a standard system like `IOStatistics` where such a stream is a IOStatisticsSource would be perfect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org