[
https://issues.apache.org/jira/browse/HADOOP-17216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199465#comment-17199465
]
Steve Loughran commented on HADOOP-17216:
-----------------------------------------
changed title to say where remaining issues are. Looking @ the code, they've
got a list of files and should just retry on FNFE calls to getFileStatus.
Nothing we can do at the hadoop layer.
FWIW, will go away if you turn S3Guard on, as it will return the file status
from the DDB table, and know to spin in the GET when there's a record in DDB.
> Delta Lake task commit encountering S3 cached 404/FileNotFoundException
> -----------------------------------------------------------------------
>
> Key: HADOOP-17216
> URL: https://issues.apache.org/jira/browse/HADOOP-17216
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.1.2
> Environment: hadoop = "3.1.2"
> hadoop-aws = "3.1.2"
> spark = "2.4.5"
> spark-on-k8s-operator = "v1beta2-1.1.2-2.4.5"
> deployed into AWS EKS kubernates. Version information below:
> Server Version: version.Info\{Major:"1", Minor:"16+",
> GitVersion:"v1.16.8-eks-e16311",
> GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean",
> BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc",
> Platform:"linux/amd64"}
> Reporter: Cheng Wei
> Priority: Major
> Fix For: 3.3.0
>
>
> Hi,
> When using spark streaming with deltalake, I got the following exception
> occasionally, something like 1 out of 100. Thanks.
> {code:java}
> Caused by: java.io.FileNotFoundException: No such file or directory:
> s3a://[pathToFolder]/date=2020-07-29/part-00005-046af631-7198-422c-8cc8-8d3adfb4413e.c000.snappy.parquet
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
> at
> org.apache.spark.sql.delta.files.DelayedCommitProtocol$$anonfun$8.apply(DelayedCommitProtocol.scala:141)
> at
> org.apache.spark.sql.delta.files.DelayedCommitProtocol$$anonfun$8.apply(DelayedCommitProtocol.scala:139)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at
> org.apache.spark.sql.delta.files.DelayedCommitProtocol.commitTask(DelayedCommitProtocol.scala:139)
> at
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:78)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242){code}
>
> -----Environment----
> hadoop = "3.1.2"
> hadoop-aws = "3.1.2"
> spark = "2.4.5"
> spark-on-k8s-operator = "v1beta2-1.1.2-2.4.5"
>
> deployed into AWS EKS kubernates. Version information below:
> Server Version: version.Info\{Major:"1", Minor:"16+",
> GitVersion:"v1.16.8-eks-e16311",
> GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean",
> BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc",
> Platform:"linux/amd64"}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]