parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136528026

   > Latency is the killer; in an HTTP request you want read enough but not 
discard data or break an http connection if the client suddenly does a seek() 
or readFully() somewhere else. file listings, existence checks etc.
   > 
   > That'd be great. now, if you could also handle requesting different 
columns in parallel and processing them out of order.
   
   I do. The Parquet file reader api that reads row groups in sync mode reads 
all columns in sequence. In async mode, it fires off a task for every column 
blocking only to read the first page of every column before returning. This 
part also uses a different thread pool from the IO tasks so that IO tasks never 
wait because there are no available threads in the thread pool.
   
   > 
   > be good to think about vectored IO.
   
   I think I know how to integrate this PR with the vectored IO, but this is 
only after a cursory look. 
   
   > 
   > and yes, updating parquet dependencies would be good, hadoop 3.3.0 should 
be the baseline.
   
   Who can drive this  (presumably) non-trivial change? I myself have no karma 
points :(
    
   > just sketched out my thoughts on this. I've played with some of this in my 
own branch. I think the next step would be for me to look at the benchmark code 
to make it targetable elsewhere.
   > 
   > 
https://docs.google.com/document/d/1y9oOSYbI6fFt547zcQJ0BD8VgvJWdyHBveaiCHzk79k/
   
   This is great. I now have much more context of where you are coming from 
(and going to) !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to