[jira] [Commented] (SPARK-47008) Spark to support S3 Express One Zone Storage

Steve Loughran (Jira) Mon, 13 May 2024 09:58:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846014#comment-17846014
 ]


Steve Loughran commented on SPARK-47008:
----------------------------------------

yes, that looks like it. real PITA this feature, though apparently its there to 
let you know that you have outstanding uploads to purge -no lifecycle rules, 
see.

FWIW you can explictly create the real situation with a touch command under a 
__magic path:

{code}
hadoop fs -touch s3a://stevel--usw1-az2--x-s3/cli/__magic/__base/d/file.txt
{code}

this creates an incomplete upload under /cli//d/file.txt

> Spark to support S3 Express One Zone Storage
> --------------------------------------------
>
>                 Key: SPARK-47008
>                 URL: https://issues.apache.org/jira/browse/SPARK-47008
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Steve Loughran
>            Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an 
> issue is that these stores report prefixes in a listing when there are 
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file 
> status entries which appears to contain one or more directories -but a 
> listStatus on that path raises a FileNotFoundException: there is nothing 
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 
> A filesystem can now be probed for inconsistent directoriy listings through 
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking 
> into a subdirectory, a list/getFileStatus on that directory raises a 
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places 
> where treewalking is done inside spark These need to be identified and make 
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods , 
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
> here, or the logic can be replicated. Using the hadoop implementation would 
> be better from a maintenance perspective



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-47008) Spark to support S3 Express One Zone Storage

Reply via email to