[
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844968#comment-17844968
]
Leo Timofeyev commented on SPARK-47008:
---------------------------------------
Hey [[email protected]]
What do you think about something like this variant?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] =
{
def recurse(status: FileStatus): Seq[FileStatus] = {
val fsHasPathCapability = try {
fs.hasPathCapability(status.getPath,
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
} catch {
case _: IOException => false
}
val statusResult = Try {
fs.listStatus(status.getPath)
}
statusResult match {
case Failure(e) =>
if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
Seq.empty[FileStatus]
}
else throw e
case Success(sr) =>
val (directories, leaves) = sr.partition(_.isDirectory)
(leaves ++ directories.flatMap(f => listLeafStatuses(fs,
f))).toImmutableArraySeq
}
} {code}
> Spark to support S3 Express One Zone Storage
> --------------------------------------------
>
> Key: SPARK-47008
> URL: https://issues.apache.org/jira/browse/SPARK-47008
> Project: Spark
> Issue Type: Sub-task
> Components: Spark Core
> Affects Versions: 4.0.0
> Reporter: Steve Loughran
> Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an
> issue is that these stores report prefixes in a listing when there are
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file
> status entries which appears to contain one or more directories -but a
> listStatus on that path raises a FileNotFoundException: there is nothing
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat,
> A filesystem can now be probed for inconsistent directoriy listings through
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking
> into a subdirectory, a list/getFileStatus on that directory raises a
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places
> where treewalking is done inside spark These need to be identified and make
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods ,
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist
> here, or the logic can be replicated. Using the hadoop implementation would
> be better from a maintenance perspective
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]