[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

liupengcheng (JIRA) Tue, 24 Jul 2018 00:20:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553907#comment-16553907
 ]


liupengcheng commented on SPARK-24897:
--------------------------------------

Can some one verify the issue?

I checked the 2.3.1 branch, this problem seems not fixed.

> DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for 
> stage fetchFailed
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24897
>                 URL: https://issues.apache.org/jira/browse/SPARK-24897
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.1.0, 2.3.1
>            Reporter: liupengcheng
>            Priority: Major
>
> In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage 
> and it's parent stage, however, when the parent stage is resubmitted and 
> start running, the mapstatuses can 
> still be invalidate by the stage's outstanding task due to fetchfailed.
> The stage's outstanding task might unregister the mapstatuses with new epoch, 
> thus causing 
> the parent stage repeated MetadataFetchFailed and finally failling the Job.
>  
>  
> {code:java}
> 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 174.0 in stage 71.0 (TID 154127, <host>, executor 96): 
> FetchFailed(BlockManagerId(4945, <host>, 22409), shuffleId=24, mapId=667, 
> reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to <host>/<ip>:22409
> 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a 
> fetch failure from ShuffleMapStage 69 ($plus$plus at 
> DeviceLocateMain.scala:236) 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 120.0 in stage 71.0 (TID 154073, <host>, executor 286): 
> FetchFailed(BlockManagerId(4208, <host>, 22409), shuffleId=24, mapId=241, 
> reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to <host>/<ip>:22409 
> 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free 
> 26.7 MB) 
> 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_59_piece1 on <host>:52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_61_piece6 on <host>:52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Canceling requests for 0 executor containers 
> 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: 
> Expected to find pending requests, but found none.
>  2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_63_piece0 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 
> GB) 
> 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free 
> 30.7 MB) 
> 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece1 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free 
> 34.7 MB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece2 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free 
> 38.7 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece3 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free 
> 42.5 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece4 in memory on <host>:56780 (size: 3.8 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTracker: Broadcast 
> mapstatuses size = 384, actual size = 20784475 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTrackerMaster: Size of 
> output statuses for shuffle 17 is 384 bytes 
> 2018-07-23,01:52:36,133 INFO org.apache.spark.MapOutputTrackerMaster: Epoch 
> changed, not caching! 
> 2018-07-23,01:52:36,185 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_61_piece3 on <host>:52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,185 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_63_piece4 on <host>:56780 in memory (size: 3.8 MB, free: 
> 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_63_piece2 on <host>:56780 in memory (size: 4.0 MB, free: 
> 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_63_piece1 on <host>:56780 in memory (size: 4.0 MB, free: 
> 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_63_piece0 on <host>:56780 in memory (size: 4.0 MB, free: 
> 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_63_piece3 on <host>:56780 in memory (size: 4.0 MB, free: 
> 3.7 GB) 
> 2018-07-23,01:52:36,192 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 1.0 in stage 69.1 (TID 154955, <host>, executor 267): FetchFailed(null, 
> shuffleId=17, mapId=-1, reduceId=-1, message= 
> org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_63_piece4 of 
> broadcast_63
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

Reply via email to