akpatnam25 removed a comment on pull request #33644: URL: https://github.com/apache/spark/pull/33644#issuecomment-894693640
for example, below is a snippet of a case that we would want to safeguard against: ``` def createValue() = new Array[Byte](180 * 1024 * 1024) def createRdd(n: Int) = sc.parallelize(0 until n, n).map(_ => createValue()) val rdd = createRdd(13) rdd.treeAggregate(createValue())((v1: Array[Byte], v2: Array[Byte]) => v1, (v1: Array[Byte], v2: Array[Byte]) => v1, 4) ``` For the above snippet, we are generating synthetic data to maximize the partitions being pulled into the driver. This is just to illustrate an example of what "bad behavior" might look like. With optimal resources on the driver side, this succeeds, but does not succeed when the driver memory is not set high enough. The applications that have encountered this issue have obviously dealt with this by increasing driver memory, etc. Given that, the overhead for most normal sized cases would just be the extra stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
