raceordie690 opened a new pull request, #13120: URL: https://github.com/apache/arrow/pull/13120
added concurrency to field readers. Even when parallel=true, there a…re times when default behavior is serial which causes very slow performance when dealing with many columns and structures with many columns. I'm working with very complex parquet files that have 500+ columns and lists of structures with 100's of columns. In the original code, getting the field readers is always done serially regardless if parallel is true. This is also true when the readers retrieve 'next batch' of records. I modified the code to perform concurrent 'read' operations in three places in two files. The performance impact is especially heavy on high-latency files, e.g., cloud storage. The original version required just over an hour to read 600+ columns from GCS. The revised version completes the same read in ~ 11 minutes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
