[GitHub] [arrow] raceordie690 opened a new pull request, #13120: Added concurrency in key places that are always serial, regardless if parallel=true or not

GitBox Wed, 11 May 2022 09:16:45 -0700


raceordie690 opened a new pull request, #13120:
URL: https://github.com/apache/arrow/pull/13120


   added concurrency to field readers.  Even when parallel=true, there a…re 
times when default behavior is serial which causes very slow performance when 
dealing with many columns and structures with many columns.
   
   I'm working with very complex parquet files that have  500+ columns and 
lists of structures with 100's of columns. In the original code, getting the 
field readers is always done serially regardless if parallel is true.  This is 
also true when the readers retrieve 'next batch' of records.  I modified the 
code to perform concurrent 'read' operations in three places in two files.  The 
performance impact is especially heavy on high-latency files, e.g., cloud 
storage.
   
   The original version required just over an hour to read 600+ columns from 
GCS.  The revised version completes the same read in ~ 11 minutes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] raceordie690 opened a new pull request, #13120: Added concurrency in key places that are always serial, regardless if parallel=true or not

Reply via email to