parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136528026
> Latency is the killer; in an HTTP request you want read enough but not
discard data or break an http connection if the client suddenly does a seek()
or readFully() somewhere else. file listings, existence checks etc.
>
> That'd be great. now, if you could also handle requesting different
columns in parallel and processing them out of order.
I do. The Parquet file reader api that reads row groups in sync mode reads
all columns in sequence. In async mode, it fires off a task for every column
blocking only to read the first page of every column before returning. This
part also uses a different thread pool from the IO tasks so that IO tasks never
wait because there are no available threads in the thread pool.
>
> be good to think about vectored IO.
I think I know how to integrate this PR with the vectored IO, but this is
only after a cursory look.
>
> and yes, updating parquet dependencies would be good, hadoop 3.3.0 should
be the baseline.
Who can drive this (presumably) non-trivial change? I myself have no karma
points :(
> just sketched out my thoughts on this. I've played with some of this in my
own branch. I think the next step would be for me to look at the benchmark code
to make it targetable elsewhere.
>
>
https://docs.google.com/document/d/1y9oOSYbI6fFt547zcQJ0BD8VgvJWdyHBveaiCHzk79k/
This is great. I now have much more context of where you are coming from
(and going to) !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]