[
https://issues.apache.org/jira/browse/ARROW-16530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-16530:
-----------------------------------
Labels: pull-request-available (was: )
> Serial read operations on columns, even when parallel = true
> ------------------------------------------------------------
>
> Key: ARROW-16530
> URL: https://issues.apache.org/jira/browse/ARROW-16530
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Go
> Affects Versions: 8.0.0
> Environment: Linux, golang 1.18, AMD64
> Reporter: Robert
> Priority: Major
> Labels: pull-request-available
> Fix For: 9.0.0
>
> Original Estimate: 24h
> Time Spent: 10m
> Remaining Estimate: 23h 50m
>
> I have submitted a pull request with the changes.
> https://github.com/apache/arrow/pull/13120#issuecomment-1123982147
> In pqarrow, when getting column readers for columns and struct members, the
> default behavior is a for loop that serially processes each column. The
> process of "getting" readers causes a read request, therefore causing these
> reads always to be issued serially. Additionally, the logic for getting next
> batch of records is executed in the same way, a for loop iterating through
> the columns. The performance impact is especially large on high-latency
> files such as cloud storage.
> Additionally, the code to retrieve the next batch of records also issues
> reads serially.
> I'm working with complex parquet files with 500+ "root" columns where some
> fields are lists of structs. Some of these structs have 100's of columns.
> In my tests, 800+ read operations are being issued to GCS serially which
> makes the current state of pqarrow too slow to be usable.
> The revision is to concurrently process the columns when retrieving child
> readers and column readers and to concurrently issue batch requests.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)