[
https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307663#comment-16307663
]
liupengcheng commented on SPARK-22903:
--------------------------------------
[~imranr] I think keep the fetchfailed stage tasks running is ok, because it
always caused by networks, and not all tasks would fail.
Wait for previous stage all task complete may be a good idea to solve the
problem, but if executorLost happen, or same other exception happens in the
task, the output attempt path may not be cleanup, the stage retry will still
encounter this exception. may be need more discussion of the fix plan?
> AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
> -------------------------------------------------------------------------
>
> Key: SPARK-22903
> URL: https://issues.apache.org/jira/browse/SPARK-22903
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.0, 2.3.0
> Environment: Spark2.1.0 + yarn
> Reporter: liupengcheng
> Labels: core
>
> We submit a Spark2.1.0 spark job, however, when MetadataFetchFailed
> exception ocurred, stage is being retried, but a AlreadyBeingCreatedException
> is thrown and finally caused job failure.
> {noformat}
> 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost
> task 13.0 in stage 7.1 (TID 18990, <host>, executor 326):
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
> Failed to create file
> [/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet]
> for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10],
> because this file is already being created by
> [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> at org.apache.hadoop.ipc.Client.call(Client.java:1477)
> at org.apache.hadoop.ipc.Client.call(Client.java:1408)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy21.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy22.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
> at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1108)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:90)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:241)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]