[
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004733#comment-16004733
]
Nick Hryhoriev edited comment on SPARK-5594 at 5/12/17 6:32 AM:
----------------------------------------------------------------
Hi,
i have the same issue.
But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of
SPARK-7689. It's already removed.
What strange, issue appeared after 3 week works. and reproduced even after
restart job.
Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used.
Stack trace
{quote}
2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4
times; aborting job
2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job
1494423050000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID
46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1):
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_141_piece0 of broadcast_141
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183)
at
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138)
at
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
at
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at
com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
at
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at
com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
at
com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at
com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0
of broadcast_141
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
{quote}
P.S during this time was performed downscale cluster.
But i don't know can this add any trouble because this is Streamin app, and
node where executor run was not in process of downscale.
P.S.S I can confirm that down scaling of yarn do not have any effects on this.
Problem appeared few time late even with out it. The only thing that help is
remove checkpoint data from hdfs. Right now i\am turn off checkpointing on
stream, and look like issue not appeared any more.
But i'am still do not understand why it's happens, what wrong with checkpoints.
I'am really need it. Because some part of my business logic related to updates
work relate to checkpoint functionality.
was (Author: hryhoriev.nick):
Hi,
i have the same issue.
But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of
SPARK-7689. It's already removed.
What strange, issue appeared after 3 week works. and reproduced even after
restart job.
Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used.
Stack trace
{quote}
2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4
times; aborting job
2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job
1494423050000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID
46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1):
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_141_piece0 of broadcast_141
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183)
at
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138)
at
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
at
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at
com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
at
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at
com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
at
com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at
com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0
of broadcast_141
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
at scala.collection.immutable.List.foreach(List.scala:381)
at
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
{quote}
P.S during this time was performed downscale cluster.
But i don't know can this add any trouble because this is Streamin app, and
node where executor run was not in process of downscale/
> SparkException: Failed to get broadcast (TorrentBroadcast)
> ----------------------------------------------------------
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.2.0, 1.3.0
> Reporter: John Sandiford
> Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below
> when running on a cluster (works locally), and have no idea what is causing
> it, or where to look for more information.
> Any help is appreciated. Others appear to experience the same issue, but I
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure:
> Lost task 3.3 in stage 6.0 (TID 24, <removed>): java.io.IOException:
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0
> of broadcast_6
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]