parthchandra commented on PR #1187:
URL: https://github.com/apache/parquet-mr/pull/1187#issuecomment-1816986727

   @steveloughran I did look into leveraging Hadoop io stats  but my first 
attempt did not work too well and I thought a simpler initial implementation 
would be more useful. Once we move to hadoop vector io, I'll take another stab 
at it. 
   
   > What would be good if this stats was set up to
   > 
   > take maps of key-value rather than a fixed enum
   
   The fixed enum here is simply the Parquet file reader providing information 
that these are the values it knows about. This implementation is not really 
collecting and aggregating anything, it is simply recording the time and counts 
and passing them on. 
    
   > collect those min/mean/max as well as counts.
   
   The implementation of the parquet metrics callback will do that. So if the 
execution engine is Spark, it can simply get the values and add them to it's 
own metrics collection subsystem which then computes the min/max/mean.
   
   > somehow provided a plugin point where we could add something to add any of 
the parquet reader/writer stats to the thread context -trying to collect stats 
from inside wrapped-many-times-over streams and iterators is way too complex. I 
know, i have a branch of parquet where I tried that...
   
   Hmm, that will take some work. I wanted to measure streaming decompression 
time (where the `decompress` call simply returns a stream which is decompressed 
as it is read), but found it required too many breaking changes to implement. 
But a standard system like `IOStatistics` where such a stream is a 
IOStatisticsSource would be perfect. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to