[
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846014#comment-17846014
]
Steve Loughran commented on SPARK-47008:
----------------------------------------
yes, that looks like it. real PITA this feature, though apparently its there to
let you know that you have outstanding uploads to purge -no lifecycle rules,
see.
FWIW you can explictly create the real situation with a touch command under a
__magic path:
{code}
hadoop fs -touch s3a://stevel--usw1-az2--x-s3/cli/__magic/__base/d/file.txt
{code}
this creates an incomplete upload under /cli//d/file.txt
> Spark to support S3 Express One Zone Storage
> --------------------------------------------
>
> Key: SPARK-47008
> URL: https://issues.apache.org/jira/browse/SPARK-47008
> Project: Spark
> Issue Type: Sub-task
> Components: Spark Core
> Affects Versions: 4.0.0
> Reporter: Steve Loughran
> Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an
> issue is that these stores report prefixes in a listing when there are
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file
> status entries which appears to contain one or more directories -but a
> listStatus on that path raises a FileNotFoundException: there is nothing
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat,
> A filesystem can now be probed for inconsistent directoriy listings through
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking
> into a subdirectory, a list/getFileStatus on that directory raises a
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places
> where treewalking is done inside spark These need to be identified and make
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods ,
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist
> here, or the logic can be replicated. Using the hadoop implementation would
> be better from a maintenance perspective
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]