parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136528026
> Latency is the killer; in an HTTP request you want read enough but not discard data or break an http connection if the client suddenly does a seek() or readFully() somewhere else. file listings, existence checks etc. > > That'd be great. now, if you could also handle requesting different columns in parallel and processing them out of order. I do. The Parquet file reader api that reads row groups in sync mode reads all columns in sequence. In async mode, it fires off a task for every column blocking only to read the first page of every column before returning. This part also uses a different thread pool from the IO tasks so that IO tasks never wait because there are no available threads in the thread pool. > > be good to think about vectored IO. I think I know how to integrate this PR with the vectored IO, but this is only after a cursory look. > > and yes, updating parquet dependencies would be good, hadoop 3.3.0 should be the baseline. Who can drive this (presumably) non-trivial change? I myself have no karma points :( > just sketched out my thoughts on this. I've played with some of this in my own branch. I think the next step would be for me to look at the benchmark code to make it targetable elsewhere. > > https://docs.google.com/document/d/1y9oOSYbI6fFt547zcQJ0BD8VgvJWdyHBveaiCHzk79k/ This is great. I now have much more context of where you are coming from (and going to) ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org