[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

GitBox Thu, 17 Nov 2022 10:30:59 -0800


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1319040697


   @wgtmac thank you for looking at this. I don't have any more TODOs on this 
PR. 
   
   > * Adopt the incoming Hadoop vectored io api.
   
   This should be part of another PR. There is a draft PR (#999 ) open for 
this. Once that is merged in, I can revisit the async I/O code and incorporate 
the vectored io api. 
   In other experiments I have seen that async io gives better results over 
slower networks. With faster network connections, as  is the case where we are 
reading from S3 within an AWS environment, reading in parallel (as the vector 
io api does), starts to give better results. 
   I believe, that both should be available as options. 
   
   > * Benchmark against remote object stores from different cloud providers.
   
   The numbers I posted earlier were for reading from AWS/S3 over a 1 Gbps 
line. Reading from within AWS shows lesser improvement. I don't have an account 
with other cloud providers. Any help here would be appreciated. 
   
   > IMO, switching `ioThreadPool` and `processThreadPool` the reader instance 
level will make it more flexible.
   
   I've changed the thread pool so that it is not initialized by default but I 
left them as static members. Ideally, there should be a single IO thread pool 
that handles all the IO for a process and the size of the pool is determined by 
the bandwidthof the underlying storage system. 
   Making them per instance is not an issue though. The calling code can decide 
to set the same thread pool for all instances and achieve the same result. 
   Let me update this. 
   
   Also, any changes you want to make are fine with me, and the help is 
certainly appreciated !
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

Reply via email to