Robert created ARROW-16530:
------------------------------

             Summary: Serial read operations on columns, even when parallel = 
true
                 Key: ARROW-16530
                 URL: https://issues.apache.org/jira/browse/ARROW-16530
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Go
    Affects Versions: 8.0.0
         Environment: Linux, golang 1.18, AMD64
            Reporter: Robert
             Fix For: 9.0.0


I have submitted a pull request with the changes.

In pqarrow, when getting column readers for columns and struct members, the 
default behavior is a for loop that serially processes each column.  The 
process of "getting" readers causes a read request, therefore causing these 
reads always to be issued serially.  Additionally, the logic for getting next 
batch of records is executed in the same way, a for loop iterating through the 
columns.  The performance impact is especially large on high-latency files such 
as cloud storage.

Additionally, the code to retrieve the next batch of records also issues reads 
serially.  

I'm working with complex parquet files with 500+ "root" columns where some 
fields are lists of structs.  Some of these structs have 100's of columns.  In 
my tests, 800+ read operations are being issued to GCS serially which makes the 
current state of pqarrow too slow to be usable.

The revision is to concurrently process the columns when retrieving child 
readers and column readers and to concurrently issue batch requests.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to