holdenk commented on PR #54370:
URL: https://github.com/apache/spark/pull/54370#issuecomment-3946934856

   > 2. If I understand correctly, we stack data right? What does it mean for 
memory usage when the df is large?
   
   So eventually we'll have to collect the summaries back to the driver but the 
summaries will all have to fit anyways even when were executing in a loop since 
we store them in Python and display them. If someone had a silly number of 
columns this could maybe be an issue, but the old approach wouldn't work well 
either.
   
   There might be a bit of extra data during the final merge steps when we're 
merging the aggregate objects but if that ever became an issue we could look at 
treeReduce (but again this would likely only happen in a degenerate case where 
the current implementation would also not behave well).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to