[ 
https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541750#comment-17541750
 ] 

ASF GitHub Bot commented on PARQUET-2149:
-----------------------------------------

theosib-amazon commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136528207

   > > Is byte (and arrays and buffers of bytes) the only datatype you support? 
My PR is optimizing code paths that pull ints, longs, and other sizes out of 
the data buffers. Are those not necessary for any of the situations where 
you're using an async buffer?
   > > The input stream API is generally unaware of the datatypes of its 
contents and so those are the only apis I use. The other reason is that the 
ParquetFileReader returns Pages which basically contain metadata and 
ByteBuffers of _compressed_ data. The decompression and decoding into types 
comes much later in a downstream thread.
   > > For your PR, I don't think the AsyncMultibufferInputStream is every 
going to be in play in the paths you're optimizing. But just in case it is, 
your type aware methods will work as is because AsyncMultibufferInputStream is 
derived from MultiBufferInputStream and will inherit those methods.
   
   I'm still learning Parquet's structure. So it sounds to me like these buffer 
input streams are used twice. Once to get data and decompress it, and then once 
again to decode it into data structures. Is that correct? So it sounds like 
you're optimizing one layer of processing, and I'm optimizing the next layer 
up, and it's kindof a coincidence that we're touching some of the same classes 
just because code reuse has been possible here.




> Implement async IO for Parquet file reader
> ------------------------------------------
>
>                 Key: PARQUET-2149
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2149
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Parth Chandra
>            Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all 
> uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked 
> until the data has been read. Because a large part of the time spent is 
> waiting for data from storage, threads are idle and CPU utilization is really 
> low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate 
> output queue -> read one uncompressed page until end -> output queue -> 
> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and 
> downstream implementations like Iceberg and Spark will automatically be able 
> to take advantage without code change as long as the ParquetFileReader apis 
> are not changed. 
> In past work with async io  [Drill - async page reader 
> |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
>  , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to