[
https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303811#comment-16303811
]
liupengcheng commented on SPARK-22903:
--------------------------------------
After reviewing the code, I finally find that the bug is caused by the wrong
attemptNumber. When a stage is retrying, a new taskSetManager is created and it
will result int the reset of taskSetManager.taskAttempts. So the generated
TaskDescription of resubmited stage will contain wrong attemptNumber starting
from zero, however, the taskAttemptPath may be created in the previous failed
stage, like
'/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet'.
So, I think it's necessary to fix this attemptNumber and use a accumated
attemptNumber taking into acount failure stages.
> AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
> -------------------------------------------------------------------------
>
> Key: SPARK-22903
> URL: https://issues.apache.org/jira/browse/SPARK-22903
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
> Reporter: liupengcheng
> Labels: core
>
> We submit a Spark2.1.0 spark job, however, when MetadataFetchFailed
> exception ocurred, stage is being retried, but a AlreadyBeingCreatedException
> is thrown and finally caused job failure.
> {noformat}
> 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost
> task 13.0 in stage 7.1 (TID 18990, <host>, executor 326):
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
> Failed to create file
> [/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet]
> for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10],
> because this file is already being created by
> [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> at org.apache.hadoop.ipc.Client.call(Client.java:1477)
> at org.apache.hadoop.ipc.Client.call(Client.java:1408)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy21.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy22.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1108)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:90)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:241)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]