Re: [I] [Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data [arrow]

via GitHub Tue, 09 Apr 2024 08:06:36 -0700


jorisvandenbossche commented on issue #40301:
URL: https://github.com/apache/arrow/issues/40301#issuecomment-2045407817


   > Is it a practical concern?
   
   I think the original reported case (converting a tiny table of a few 
kilobytes to pandas can give a spike of several hundred MBs in memory usage) is 
something people can certainly run into. And although it will often not be 
concerning (typically when working with smaller data, memory usage is not an 
issue, and when actually working with larger tables and memory usage becomes 
relevant, this overhead will disappear), it is definitely surprising and can 
lead to confusion. So I think it is worth "fixing".
   
   But the potential fix I was thinking of could also be something much 
simpler, like with some heuristic decide to just not do the conversion in 
parallel for smaller data. 
   For the conversion the other way around (pandas -> pyarrow), we actually 
have some heuristic currently (in python):
   
   
https://github.com/apache/arrow/blob/a6e577d031d20a1a7d3dd01536b9a77db5d1bff8/python/pyarrow/pandas_compat.py#L573-L581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data [arrow]

Reply via email to