[GitHub] [spark] akpatnam25 commented on pull request #33644: [SPARK-36419][CORE]: Optionally move final aggregation in RDD.treeAggregate to executor

GitBox Sat, 07 Aug 2021 11:58:13 -0700


akpatnam25 commented on pull request #33644:
URL: https://github.com/apache/spark/pull/33644#issuecomment-894693640



   for example, below is a snippet of a case that we would want to safeguard 
against: 
   ```
   def createValue() = new Array[Byte](180 * 1024 * 1024)
   def createRdd(n: Int) = sc.parallelize(0 until n, n).map(_ => createValue())
   val rdd = createRdd(13)
   rdd.treeAggregate(createValue())((v1: Array[Byte], v2: Array[Byte]) => v1, 
(v1: Array[Byte], v2: Array[Byte]) => v1, 4)
   ```
   For the above snippet, we are generating synthetic data to maximize the 
partitions being pulled into the driver. This is just to illustrate an example 
of what "bad behavior" might look like. With optimal resources on the driver 
side, this succeeds, but does not succeed when the driver memory is not set 
high enough.  The applications that have encountered this issue have obviously 
dealt with this by increasing driver memory, etc. 
   
   Given that, the overhead for most normal sized cases would just be the extra 
stage. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] akpatnam25 commented on pull request #33644: [SPARK-36419][CORE]: Optionally move final aggregation in RDD.treeAggregate to executor

Reply via email to