Recently we switched to EMR 4.5/Spark 1.6.1 and have since encountered a new
failure scenario.
The primary symptom is that the cluster appears to be stalled. The job has not
failed but will not proceed and has to be killed. One or more
RpcTimeoutException s (see below) are usually found towards the bottom of the
log. Another observation is that, using the status API, I can see that the
jvmGcTime value (see JSON snippet below) for some tasks in a stage that never
finished is abnormally large (in the millions). Even for identical
queries/data, this problem doesn't always happen, and when the cluster doesn't
hang, I also do not see these large values for jvmGcTime.
Some questions:
1. What units is jvmGcTime in?
2. If it is in milliseconds, then that seems to indicate garbage collections
times into the tens of minutes - far longer that the queries themselves usually
take and also longer than the default spark.rpc.askTimeout value of 120s. Could
this be blocking/starving the promise and causing the failure?
3. We are doing almost entirely standard SQL, no UDFs, and no use of the RDD
API. I was under the impression that in Spark 1.6, pretty much everything
DataFrames is using the custom Tungsten serialization and unsafe buffers, even
in the actual processing (so no actual unroll). Is this correct? What could be
causing such GC churn?
===============================================================================
The exception found in the log (several stack trace lines removed for brevity:
16/04/29 17:43:29 ERROR ContextCleaner: Error cleaning broadcast 0
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
.
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
.
... 12 more
16/04/29 17:43:29 WARN BlockManagerMaster: Failed to remove broadcast 0 with
removeFromMaster = true - Cannot receive any reply in 120 seconds. This timeout
is controlled by spark.rpc.askTimeout
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120
seconds. This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
.
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in
120 seconds
at
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
... 7 more
16/04/29 17:43:29 INFO ContextCleaner: Cleaned accumulator 2
16/04/29 19:33:23 INFO PackagesResourceConfig: Scanning for root resource and
provider classes in the packages:
org.apache.spark.status.api.v1
16/04/29 19:33:28 INFO ScanningResourceConfig: Root resource classes found:
class org.apache.spark.status.api.v1.ApiRootResource
16/04/29 19:33:28 INFO ScanningResourceConfig: Provider classes found:
class org.apache.spark.status.api.v1.JacksonMessageWriter
16/04/29 19:33:28 INFO WebApplicationImpl: Initiating Jersey application,
version 'Jersey: 1.9 09/02/2011 11:17 AM'
16/04/29 19:34:09 WARN Errors: The following warnings have been detected with
resource and/or provider classes:
WARNING: A sub-resource method, public scala.collection.Seq
org.apache.spark.status.api.v1.OneStageResource.stageData(int), with URI
template, "", is treated as a resource method
===============================================================================
A sample from the adasdasdasdasd:
"1446406" : {
"taskId" : 1446406,
"index" : 593,
"attempt" : 0,
"launchTime" : "2016-04-27T11:16:25.005GMT",
"executorId" : "364",
"host" : "ip-10-0-0-41.ec2.internal",
"taskLocality" : "RACK_LOCAL",
"speculative" : false,
"accumulatorUpdates" : [ ],
"taskMetrics" : {
"executorDeserializeTime" : 0,
"executorRunTime" : 0,
"resultSize" : 0,
"jvmGcTime" : 1637765,
"resultSerializationTime" : 0,
"memoryBytesSpilled" : 0,
"diskBytesSpilled" : 0,
"inputMetrics" : {
"bytesRead" : 40824341,
"recordsRead" : 1130000
},
"shuffleWriteMetrics" : {
"bytesWritten" : 0,
"writeTime" : 0,
"recordsWritten" : 0
}
}
},
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]