parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130270383
> @parthchandra Would you mind having a look at my I/O performance optimization plan for ParquetMR? I think we should coordinate, since we have some ideas that might overlap what we touch. > https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing @theosib-amazon I read your document and went thru #960. It looks like for the most part, #960 and this PR and complement each other. The overlap I see is in the changes to `MultiBufferInputStream` where you have added the `readFully`, and `skipFully` APIs. The bulk of my changes for async IO are in a class derived from `MultiBufferInputStream` and the heart of the changes depends on overriding `MultiBufferInputStream.nextBuffer`. In `MultiBufferInputStream.nextBuffer` the assumption is that all the buffers have been read into. In `AsyncMultiBufferInputStream.nextBuffer` this assumption is removed and the call *blocks* only if the next required buffer has not been read into. Now, `skipFully` and `readFully` are potentially blocking calls because both call `nextBuffer` repeatedly if necessary. To gain maximum pipelining, you want to make calls to skipFully and readFully such that you never block for too long (or at all) in the call. You will get this if you are skipping or reading less than the number of bytes in a single buffer. This is generally the case as decompression and decoding is at the page level and that is smaller than the size of a single buffer. However, for your optimizations, you should be aware of this behaviour. From what I see, I don't think there will be a conflict. I'll pull in your PR and give it a deeper look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org