[jira] [Updated] (ARROW-16530) Serial read operations on columns, even when parallel = true

ASF GitHub Bot (Jira) Wed, 11 May 2022 09:53:04 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated ARROW-16530:
-----------------------------------
    Labels: pull-request-available  (was: )

> Serial read operations on columns, even when parallel = true
> ------------------------------------------------------------
>
>                 Key: ARROW-16530
>                 URL: https://issues.apache.org/jira/browse/ARROW-16530
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Go
>    Affects Versions: 8.0.0
>         Environment: Linux, golang 1.18, AMD64
>            Reporter: Robert
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 9.0.0
>
>   Original Estimate: 24h
>          Time Spent: 10m
>  Remaining Estimate: 23h 50m
>
> I have submitted a pull request with the changes.
>  https://github.com/apache/arrow/pull/13120#issuecomment-1123982147
> In pqarrow, when getting column readers for columns and struct members, the 
> default behavior is a for loop that serially processes each column.  The 
> process of "getting" readers causes a read request, therefore causing these 
> reads always to be issued serially.  Additionally, the logic for getting next 
> batch of records is executed in the same way, a for loop iterating through 
> the columns.  The performance impact is especially large on high-latency 
> files such as cloud storage.
> Additionally, the code to retrieve the next batch of records also issues 
> reads serially.  
> I'm working with complex parquet files with 500+ "root" columns where some 
> fields are lists of structs.  Some of these structs have 100's of columns.  
> In my tests, 800+ read operations are being issued to GCS serially which 
> makes the current state of pqarrow too slow to be usable.
> The revision is to concurrently process the columns when retrieving child 
> readers and column readers and to concurrently issue batch requests.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16530) Serial read operations on columns, even when parallel = true

Reply via email to