[jira] [Commented] (SPARK-22903) AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber

liupengcheng (JIRA) Tue, 26 Dec 2017 04:45:30 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303811#comment-16303811
 ]


liupengcheng commented on SPARK-22903:
--------------------------------------

After reviewing the code, I finally find that the bug is caused by the wrong 
attemptNumber. When a stage is retrying, a new taskSetManager is created and it 
will result int the reset of taskSetManager.taskAttempts. So the generated 
TaskDescription of resubmited stage will contain wrong attemptNumber starting 
from zero, however, the taskAttemptPath may be created in the previous failed 
stage, like 
'/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet'.
 

So, I think it's necessary to fix this attemptNumber and use a accumated 
attemptNumber taking into acount failure stages.

> AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
> -------------------------------------------------------------------------
>
>                 Key: SPARK-22903
>                 URL: https://issues.apache.org/jira/browse/SPARK-22903
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.0
>         Environment: Spark2.1.0 + yarn
>            Reporter: liupengcheng
>              Labels: core
>
> We  submit a Spark2.1.0 spark job, however, when MetadataFetchFailed 
> exception ocurred, stage is being retried, but a AlreadyBeingCreatedException 
> is thrown and finally caused job failure.
> {noformat}
> 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 13.0 in stage 7.1 (TID 18990, <host>, executor 326): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  Failed to create file 
> [/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet]
>  for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10], 
> because this file is already being created by 
> [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26]
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1477)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1408)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy21.create(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy22.create(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806)
>         at 
> org.apache.parquet.hadoop.ParquetFileWriter.&lt;init&gt;(ParquetFileWriter.java:176)
>         at 
> org.apache.parquet.hadoop.ParquetFileWriter.&lt;init&gt;(ParquetFileWriter.java:160)
>         at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
>         at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>         at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1108)
>         at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>         at org.apache.spark.scheduler.Task.run(Task.scala:90)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:241)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-22903) AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber

Reply via email to