[GitHub] [arrow] westonpace commented on pull request #14269: ARROW-17481: [C++][Python] Major performance improvements to CSV reading from S3

GitBox Thu, 08 Dec 2022 11:58:54 -0800


westonpace commented on PR #14269:
URL: https://github.com/apache/arrow/pull/14269#issuecomment-1343280843


   > However, I also noticed a potential problem with the current generator 
usage in the CSV reader that needs to be investigated: 
https://github.com/apache/arrow/issues/14792
   
   I think we're ok here because we aren't actually consuming that generator 
async-reentrantly.  The apply generate can be called re-entrantly (which would 
be a problem) but it appears this PR is using MakeSerialReadaheadGenerator.  I 
seem to recall we ran into a bug when we tried MakeReadaheadGenerator and I 
wonder if this was it.
   
   So the current implementation will do parallel-I/O, which is nice, and will 
interleave parsing and decoding.  However, it does not do parallel parsing.  I 
think we eventually want this but we will want to find a better way of handling 
#14792 .  Given that the parallel I/O is already giving some nice benefit when 
reading from S3 perhaps the parallel parse could be left for a follow-up PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #14269: ARROW-17481: [C++][Python] Major performance improvements to CSV reading from S3

Reply via email to