parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130270383

   
   > @parthchandra Would you mind having a look at my I/O performance 
optimization plan for ParquetMR? I think we should coordinate, since we have 
some ideas that might overlap what we touch.
   > 
https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing
   
   @theosib-amazon I read your document and went thru  #960. It looks like for 
the most part, #960 and this PR and complement each other.  The overlap I see 
is in the changes to `MultiBufferInputStream` where you have added the 
`readFully`, and `skipFully` APIs. The bulk of my changes for async IO are in a 
class derived from `MultiBufferInputStream` and the heart of the changes 
depends on overriding `MultiBufferInputStream.nextBuffer`. In 
`MultiBufferInputStream.nextBuffer` the assumption is that all the buffers have 
been read into. In `AsyncMultiBufferInputStream.nextBuffer` this assumption is 
removed and the call *blocks* only if the next required buffer has not been 
read into.
   Now, `skipFully` and `readFully` are potentially blocking calls because both 
call `nextBuffer` repeatedly if necessary. To gain  maximum pipelining, you 
want to make calls to skipFully and readFully such that you never block for too 
long (or at all) in the call. You will get this if you are skipping or reading 
less than the number of bytes in a single buffer. This is generally the case as 
decompression and decoding is at the page level and that is smaller than the 
size of a single buffer. However, for your optimizations, you should be aware 
of this behaviour.
   From what I see, I don't think there will be a conflict.  I'll pull in your 
PR and give it a deeper look. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to