[jira] [Comment Edited] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

Nick Hryhoriev (JIRA) Thu, 11 May 2017 23:33:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004733#comment-16004733
 ]


Nick Hryhoriev edited comment on SPARK-5594 at 5/12/17 6:32 AM:
----------------------------------------------------------------

Hi,
i have the same issue.
But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of 
 SPARK-7689. It's already removed.
What strange, issue appeared after 3 week works. and reproduced even after 
restart job.
Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used.
Stack trace
{quote}
2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4 
times; aborting job
2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job 
1494423050000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 
46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1): 
java.io.IOException: org.apache.spark.SparkException: Failed to get 
broadcast_141_piece0 of broadcast_141
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
        at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
        at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
        at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
        at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at 
com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183)
        at 
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138)
        at 
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at 
com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
        at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
        at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
        at 
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
        at 
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
        at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
        at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
        at 
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
        at 
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
        at 
com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
        at 
com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
        at 
com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
        at 
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at 
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 
of broadcast_141
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
{quote}

P.S during this time was performed downscale cluster.
 But i don't know can this add any trouble because this is Streamin app, and  
node where executor run was not in process of downscale.

P.S.S I can confirm that down scaling of yarn do not have any effects on this.
Problem appeared few time late even with out it. The only thing that help is 
remove checkpoint data from hdfs. Right now i\am turn off checkpointing on 
stream, and look like issue not appeared any more.
But i'am still do not understand why it's happens, what wrong with checkpoints. 
I'am really need it. Because some part of my business logic related to  updates 
work relate to checkpoint functionality.


was (Author: hryhoriev.nick):
Hi,
i have the same issue.
But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of 
 SPARK-7689. It's already removed.
What strange, issue appeared after 3 week works. and reproduced even after 
restart job.
Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used.
Stack trace
{quote}
2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4 
times; aborting job
2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job 
1494423050000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 
46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1): 
java.io.IOException: org.apache.spark.SparkException: Failed to get 
broadcast_141_piece0 of broadcast_141
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
        at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
        at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
        at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
        at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at 
com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183)
        at 
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138)
        at 
com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at 
com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
        at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
        at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at 
com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
        at 
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
        at 
com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
        at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
        at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
        at 
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
        at 
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
        at 
com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
        at 
com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
        at 
com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
        at 
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at 
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 
of broadcast_141
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
        at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:218)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
{quote}

P.S during this time was performed downscale cluster.
 But i don't know can this add any trouble because this is Streamin app, and  
node where executor run was not in process of downscale/

> SparkException: Failed to get broadcast (TorrentBroadcast)
> ----------------------------------------------------------
>
>                 Key: SPARK-5594
>                 URL: https://issues.apache.org/jira/browse/SPARK-5594
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: John Sandiford
>            Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, <removed>): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
>         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
>         at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
>         at scala.Option.getOrElse(Option.scala:120)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
>         at scala.collection.immutable.List.foreach(List.scala:318)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
>         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
>         ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
>         at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>         at scala.Option.foreach(Option.scala:236)
>         at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

Reply via email to