Re: [I] [Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data [arrow]

via GitHub Fri, 01 Mar 2024 14:23:29 -0800


anjakefala commented on issue #40301:
URL: https://github.com/apache/arrow/issues/40301#issuecomment-1974015365


   @felipecrv Is modifying `ThreadPool` better than an option where we use an 
approach similar to [the SplitBlockCreator 
class](https://github.com/apache/arrow/blob/a6e577d031d20a1a7d3dd01536b9a77db5d1bff8/python/pyarrow/src/arrow/python/arrow_to_pandas.cc#L2422)
 for tables under a certain size? That's more along the line of what I was 
thinking of. 
   
   However, if you think `work-stealing` would be the most robust solution, 
that other functions would benefit from, I'd be game for approaching this. 
   
   I prefer the work-stealing approach because, ideally, we wouldn't require 
the user to know about the existence of an option to set. Folks might not know 
that the memory usage has to do with the spawning of individual threads. They 
might not even know why `to_pandas` spawns multiple threads. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data [arrow]

Reply via email to