[
https://issues.apache.org/jira/browse/HADOOP-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17740546#comment-17740546
]
Harunobu Daikoku commented on HADOOP-18793:
-------------------------------------------
I'm guessing ${UUID} directory is preserved on purpose as the commit job ID
provided by Spark was not guaranteed to be unique historically, as described in
the document.
Deleting one's own staging directory might impact other ongoing commit jobs
sharing the same ID (timestamp).
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#Job_commit_fails_java.io.FileNotFoundException_.E2.80.9CFile_hdfs:.2F.2F....2Fstaging-uploads.2F_temporary.2F0_does_not_exist.E2.80.9D
{quote}
Spark generates job IDs for its committers using the current timestamp, and if
two jobs/stages are started in the same second, they will have the same job ID.
{quote}
But now I think it would be safe to delete it on a certain condition:
fs.s3a.committer.require.uuid is enabled and there is almost no risk of ${UUID}
collision.
> S3A StagingCommitter does not clean up staging-uploads directory
> ----------------------------------------------------------------
>
> Key: HADOOP-18793
> URL: https://issues.apache.org/jira/browse/HADOOP-18793
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.2.2
> Reporter: Harunobu Daikoku
> Priority: Minor
>
> When setting up StagingCommitter and its internal FileOutputCommitter, a
> temporary directory that holds MPU information will be created on the default
> FS, which by default is to be
> /user/${USER}/tmp/staging/${USER}/${UUID}/staging-uploads.
> On a successful job commit, its child directory (_temporary) will be [cleaned
> up|https://github.com/apache/hadoop/blob/a36d8adfd18e88f2752f4387ac4497aadd3a74e7/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/StagingCommitter.java#L516]
> properly, but ${UUID}/staging-uploads will remain.
> This will result in having too many empty ${UUID}/staging-uploads directories
> under /user/${USER}/tmp/staging/${USER}, and will eventually cause an issue
> in an environment where the max number of items in a directory is capped
> (e.g. by dfs.namenode.fs-limits.max-directory-items in HDFS).
> {noformat}
> The directory item limit of /user/${USER}/tmp/staging/${USER} is exceeded:
> limit=1048576 items=1048576
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1205)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]