Recently we switched to EMR 4.5/Spark 1.6.1 and have since encountered a new failure scenario.
The primary symptom is that the cluster appears to be stalled. The job has not failed but will not proceed and has to be killed. One or more RpcTimeoutException s (see below) are usually found towards the bottom of the log. Another observation is that, using the status API, I can see that the jvmGcTime value (see JSON snippet below) for some tasks in a stage that never finished is abnormally large (in the millions). Even for identical queries/data, this problem doesn't always happen, and when the cluster doesn't hang, I also do not see these large values for jvmGcTime. Some questions: 1. What units is jvmGcTime in? 2. If it is in milliseconds, then that seems to indicate garbage collections times into the tens of minutes - far longer that the queries themselves usually take and also longer than the default spark.rpc.askTimeout value of 120s. Could this be blocking/starving the promise and causing the failure? 3. We are doing almost entirely standard SQL, no UDFs, and no use of the RDD API. I was under the impression that in Spark 1.6, pretty much everything DataFrames is using the custom Tungsten serialization and unsafe buffers, even in the actual processing (so no actual unroll). Is this correct? What could be causing such GC churn? =============================================================================== The exception found in the log (several stack trace lines removed for brevity: 16/04/29 17:43:29 ERROR ContextCleaner: Error cleaning broadcast 0 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) . at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) . ... 12 more 16/04/29 17:43:29 WARN BlockManagerMaster: Failed to remove broadcast 0 with removeFromMaster = true - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) . java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242) ... 7 more 16/04/29 17:43:29 INFO ContextCleaner: Cleaned accumulator 2 16/04/29 19:33:23 INFO PackagesResourceConfig: Scanning for root resource and provider classes in the packages: org.apache.spark.status.api.v1 16/04/29 19:33:28 INFO ScanningResourceConfig: Root resource classes found: class org.apache.spark.status.api.v1.ApiRootResource 16/04/29 19:33:28 INFO ScanningResourceConfig: Provider classes found: class org.apache.spark.status.api.v1.JacksonMessageWriter 16/04/29 19:33:28 INFO WebApplicationImpl: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM' 16/04/29 19:34:09 WARN Errors: The following warnings have been detected with resource and/or provider classes: WARNING: A sub-resource method, public scala.collection.Seq org.apache.spark.status.api.v1.OneStageResource.stageData(int), with URI template, "", is treated as a resource method =============================================================================== A sample from the adasdasdasdasd: "1446406" : { "taskId" : 1446406, "index" : 593, "attempt" : 0, "launchTime" : "2016-04-27T11:16:25.005GMT", "executorId" : "364", "host" : "ip-10-0-0-41.ec2.internal", "taskLocality" : "RACK_LOCAL", "speculative" : false, "accumulatorUpdates" : [ ], "taskMetrics" : { "executorDeserializeTime" : 0, "executorRunTime" : 0, "resultSize" : 0, "jvmGcTime" : 1637765, "resultSerializationTime" : 0, "memoryBytesSpilled" : 0, "diskBytesSpilled" : 0, "inputMetrics" : { "bytesRead" : 40824341, "recordsRead" : 1130000 }, "shuffleWriteMetrics" : { "bytesWritten" : 0, "writeTime" : 0, "recordsWritten" : 0 } } }, --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org