Hi I am consistently observing driver OutOfMemoryError (Java heap space) during shuffling operation indicated by the log:
………… 16/05/14 21:57:03 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 36060250 bytes à shuffle metadata size is big and the full metadata will be sent to all workers? 16/05/14 21:57:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to <host1>:45757 16/05/14 21:57:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to <host2>:20300 16/05/14 21:57:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to <host3>:12389 16/05/14 21:57:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to <host4>:32197 ………… Exception in thread "dispatcher-event-loop-17" Exception in thread "dispatcher-event-loop-3" Exception in thread "dispatcher-event-loop-6" 16/05/14 21:59:04 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to <host5>:19639 Exception in thread "dispatcher-event-loop-21" 16/05/14 21:59:08 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to <host6>:58461 Exception in thread "dispatcher-event-loop-20" Exception in thread "dispatcher-event-loop-13" Exception in thread "dispatcher-event-loop-9" java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178) at org.apache.spark.serializer. JavaSerializerInstance.serialize(JavaSerializer.scala:103) à shuffle metadata duplicated (?) when sending to each executor? at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:252) at org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64) at org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32) at org.apache.spark.MapOutputTrackerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(MapOutputTracker.scala:62) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) I enable memory dump and used jhat to analyze. In heap histogram, I found 146 byte array objects with exact same size of 36,060,293 bytes. I wonder if the 146 big objects are actually duplicates of the same shuffle metadata, *can experts please help understand if it's true?* (8G driver memory was specified for the above run, should be sufficient for the 36M shuffle metadata. but probably not for 146 duplicates) thanks, Renyi.