akpatnam25 commented on pull request #33644: URL: https://github.com/apache/spark/pull/33644#issuecomment-894694073
for example, below is a snippet of a case that we would want to safeguard against: ``` def createValue() = new Array[Byte](180 * 1024 * 1024) def createRdd(n: Int) = sc.parallelize(0 until n, n).map(_ => createValue()) val rdd = createRdd(13) rdd.treeAggregate(createValue())((v1: Array[Byte], v2: Array[Byte]) => v1, (v1: Array[Byte], v2: Array[Byte]) => v1, 4) ``` For the above snippet, we are generating synthetic data to maximize the partitions being pulled into the driver. This is just to illustrate an example of what "bad behavior" might look like. With optimal resources on the driver side, this succeeds, but does not succeed when the driver memory is not set high enough. The applications that have encountered this issue have obviously dealt with this by increasing driver memory, etc. Given that, the overhead for most normal sized cases would just be the extra stage. In our above synthetic workload, there was no noticeable difference in time to complete despite the number of partitions we are pulling into the last stage. This last reduce step would be happening on the driver side anyways, so the only real overhead is the shuffle and the scheduling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
