[jira] [Created] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

Chao Sun (Jira) Mon, 16 Aug 2021 11:27:04 -0700

Chao Sun created SPARK-36529:
--------------------------------

             Summary: Decouple CPU with IO work in vectorized Parquet reader
                 Key: SPARK-36529
                 URL: https://issues.apache.org/jira/browse/SPARK-36529
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Chao Sun



Currently it seems the vectorized Parquet reader does almost everything in a 
sequential manner:
1. read the row group using file system API (perhaps from remote storage like 
S3)
2. allocate buffers and store those row group bytes into them
3. decompress the data pages
4. in Spark, decode all the read columns one by one
5. read the next row group and repeat from 1.

A lot of improvements can be done to decouple the IO and CPU intensive work. In 
addition, we could parallelize the row group loading and column decoding, and 
utilizing all the cores available for a Spark task.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

Reply via email to