[
https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307633#comment-16307633
]
liupengcheng commented on SPARK-22903:
--------------------------------------
[~imranr] I agree what you said about the cleanup did not complete may cause
the exception, but the ResultTask saving path is a user spec args, and it will
be serialized into the func closure, I don't think we can or we should get it
from the closure and do the cleanup.
What's more, another case will also cause the problem, currently spark will
keep fetchfailed stage tasks running and will not kill them, so when the failed
stage task not finished and the same task of next stage is submitted, the
conflict may cause the exception.
> AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
> -------------------------------------------------------------------------
>
> Key: SPARK-22903
> URL: https://issues.apache.org/jira/browse/SPARK-22903
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
> Reporter: liupengcheng
> Labels: core
>
> We submit a Spark2.1.0 spark job, however, when MetadataFetchFailed
> exception ocurred, stage is being retried, but a AlreadyBeingCreatedException
> is thrown and finally caused job failure.
> {noformat}
> 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost
> task 13.0 in stage 7.1 (TID 18990, <host>, executor 326):
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
> Failed to create file
> [/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet]
> for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10],
> because this file is already being created by
> [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> at org.apache.hadoop.ipc.Client.call(Client.java:1477)
> at org.apache.hadoop.ipc.Client.call(Client.java:1408)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy21.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy22.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1108)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:90)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:241)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]