[GitHub] [arrow] lidavidm commented on pull request #9620: ARROW-11843: [C++] Provide reentrant Parquet reader

GitBox Fri, 02 Apr 2021 13:37:04 -0700


lidavidm commented on pull request #9620:
URL: https://github.com/apache/arrow/pull/9620#issuecomment-812701516



   Okay, now that I've actually checked out the right branch…
   
   So long as pre-buffering is enabled, this PR in conjunction with ARROW-7001 
is either a big win (for S3) or no effect (locally). Hence I'd argue we should 
just always enable pre-buffer. (The reason is that without refactoring the 
Parquet reader heavily, without enabling pre-buffer, the generator is 
effectively synchronous. I could go through and do the refactor, but 
pre-buffering gives us an 'easy' way to convert the I/O to be async. If we 
want, we could change the read range cache to optionally be lazy, which would 
effectively be the same as refactoring the Parquet reader.)
   
   Also, this changes the ParquetScanTask so that it manages intra-file 
concurrency internally. Hence, ParquetFileFragment only needs to generate one 
scan task now and doesn't have to do anything complicated around pre-buffering.
   
   ![Local Median Scan Time 
(1)](https://user-images.githubusercontent.com/327919/113451950-13871580-93d1-11eb-9150-d88917c5c66d.png)
   ![S3 Median Scan Time 
(1)](https://user-images.githubusercontent.com/327919/113451953-15e96f80-93d1-11eb-9bca-94360a5e94f4.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on pull request #9620: ARROW-11843: [C++] Provide reentrant Parquet reader

Reply via email to