[jira] [Commented] (SPARK-22903) AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber

liupengcheng (JIRA) Mon, 01 Jan 2018 19:16:00 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307633#comment-16307633
 ]


liupengcheng commented on SPARK-22903:
--------------------------------------

[~imranr] I agree what you said about the cleanup did not complete may cause 
the exception, but the ResultTask saving path is a user spec args, and it will 
be serialized into the func closure, I don't think we can or we should get it 
from the closure and do the cleanup.
What's more, another case will also cause the problem, currently spark will 
keep fetchfailed stage tasks running and will not kill them, so when the failed 
stage task not finished and the same task of next stage is submitted, the 
conflict may cause the exception.


> AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
> -------------------------------------------------------------------------
>
>                 Key: SPARK-22903
>                 URL: https://issues.apache.org/jira/browse/SPARK-22903
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.0
>         Environment: Spark2.1.0 + yarn
>            Reporter: liupengcheng
>              Labels: core
>
> We  submit a Spark2.1.0 spark job, however, when MetadataFetchFailed 
> exception ocurred, stage is being retried, but a AlreadyBeingCreatedException 
> is thrown and finally caused job failure.
> {noformat}
> 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 13.0 in stage 7.1 (TID 18990, <host>, executor 326): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  Failed to create file 
> [/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet]
>  for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10], 
> because this file is already being created by 
> [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26]
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1477)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1408)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy21.create(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy22.create(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806)
>         at 
> org.apache.parquet.hadoop.ParquetFileWriter.&lt;init&gt;(ParquetFileWriter.java:176)
>         at 
> org.apache.parquet.hadoop.ParquetFileWriter.&lt;init&gt;(ParquetFileWriter.java:160)
>         at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
>         at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>         at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1108)
>         at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>         at org.apache.spark.scheduler.Task.run(Task.scala:90)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:241)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-22903) AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber

Reply via email to