Is there any formula with which I could determine Shuffle Write before execution?
For example, in Sort Merge join in the stage in which the first table is being loaded the shuffle write is 429.2 MB. The table is 5.5G in the HDFS with block size 128 MB. Consequently is being loaded in 45 tasks/partitions. How this 5.5 GB results in 429 MB? Could I determine it before execution? Environment: #Workers = 2 #Cores/Worker = 4 #Assigned Memory / Worker = 512M spark.shuffle.partitions=200 spark.shuffle.compress=false spark.shuffle.memoryFraction=0.1 spark.shuffle.spill=true -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Shuffle-Write-Size-tp15779.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org