[jira] [Resolved] (SPARK-7054) Spark jobs hang for ~15 mins when a node goes down

Patrick Wendell (JIRA) Wed, 22 Apr 2015 21:16:00 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-7054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Patrick Wendell resolved SPARK-7054.
------------------------------------
    Resolution: Invalid

Hey There,

Please send this to the Spark users list to get feedback and help to further 
isolate the issue. As it stands now it's underspecified for a JIRA.

> Spark jobs hang for ~15 mins when a node goes down
> --------------------------------------------------
>
>                 Key: SPARK-7054
>                 URL: https://issues.apache.org/jira/browse/SPARK-7054
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.1
>         Environment: Cent OS - 6 ,Java 8
>            Reporter: Abhishek Choudhary
>            Priority: Blocker
>
> In a four node cluster (on VMs) having 2 Namenodes and 2 Datanodes with 10 
> executors (Yarn 2.4) Spark jobs are running in yarn-client mode. When a 
> running vm is shut down, spark job hangs for ~15 mins .
> After ~45-50 seconds driver got information of lost block managers,
> From logs : 
> 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
> BlockManagerId(9, ACUME-DN2, 40898) with no recent heart beats: 59674ms 
> exceeds 45000ms
> 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
> BlockManagerId(5, ACUME-DN2, 37947) with no recent heart beats: 60044ms 
> exceeds 45000ms
> 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
> BlockManagerId(3, ACUME-DN2, 49808) with no recent heart beats: 54637ms 
> exceeds 45000ms
> 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
> BlockManagerId(1, ACUME-DN2, 44090) with no recent heart beats: 59049ms 
> exceeds 45000ms
> 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
> org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
> BlockManagerId(7, ACUME-DN2, 47267) with no recent heart beats: 56879ms 
> exceeds 45000ms
> After ~15 mins Spark driver got executor lost event and rescheduled failed 
> tasks
> From logs :
> 2015-04-22 10:05:04,965 [sparkDriver-akka.actor.default-dispatcher-19] ERROR 
> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Lost executor 
> 1 on ACUME-DN2: remote Akka client disassociated
> For these 15 mins all the jobs were stuck for executors running on shutdown 
> vm .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7054) Spark jobs hang for ~15 mins when a node goes down

Reply via email to