[
https://issues.apache.org/jira/browse/HADOOP-17217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kaya Kupferschmidt updated HADOOP-17217:
----------------------------------------
Summary: S3A FileSystem does not correctly delete directories with fake
entries (was: S3A FileSystem gets confused by fake directory entries)
> S3A FileSystem does not correctly delete directories with fake entries
> ----------------------------------------------------------------------
>
> Key: HADOOP-17217
> URL: https://issues.apache.org/jira/browse/HADOOP-17217
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.2.0
> Reporter: Kaya Kupferschmidt
> Priority: Major
>
> h3. Summary
> We are facing an issue where the Hadoop S3A Filesystem gets confused by fake
> directory objects in S3. Specifically trying to recursively remove a whole
> Hadoop directory (i.e. all objects with the same S3 prefix) doesn't work
> correctly.
> h2. Background
> We are using Alluxio together with S3 as our deep store. For some
> infrastructure reasons we decided to directly write to S3 (bypassing Alluxio)
> with our Spark applications, and only use Alluxio for reading data.
> When we directly write our results into S3 with Spark, everything is fine.
> But once Alluxio accesses S3, it will create these fake directory entries.
> When we now try to overwrite existing data in S3 with a Spark application,
> the result is incorrect, since Spark will only remove the fake directory
> entry, but not all other objects below that prefix in S3.
> Of course it is questionable if Alluxio is doing the right thing, but on the
> other hand it seems that Hadoop also does not behave as we expected.
> h2. Steps to Reproduce
> The following steps only require AWS CLI and Hadoop CLI to reproduce the
> issue we are facing:
> h3. Initial setup
> {code:bash}
> # First step: Create an new and empty bucket in S3
> $ aws s3 mb s3://dimajix-tmp
>
> make_bucket: dimajix-tmp
> $ aws s3 ls
> 2020-08-21 11:19:50 dimajix-tmp
> # Upload some data
> $ aws s3 cp some_file.txt s3://dimajix-tmp/tmp/
> upload: ./some_file.txt to s3://dimajix-tmp/tmp/some_file.txt
> $ aws s3 ls s3://dimajix-tmp/tmp/
>
> 2020-08-21 11:23:35 0 some_file.txt
> # Check that Hadoop can list the file
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/
> Found 1 items
> drwxrwxrwx - kaya kaya 0 2020-08-21 11:24 s3a://dimajix-tmp/tmp
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> -rw-rw-rw- 1 kaya kaya 0 2020-08-21 11:23
> s3a://dimajix-tmp/tmp/some_file.txt
> # Evil step: Create fake directory entry in S3
> $ aws s3api put-object --bucket dimajix-tmp --key tmp/
> {
> "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
> }
> # Look into S3, ensure that fake directory entry was created
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:25:40 0
> 2020-08-21 11:23:35 0 some_file.txt
> # Look into S3 using Hadoop CLI, ensure that everything looks okay (which is
> the case)
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> Found 1 items
> -rw-rw-rw- 1 kaya kaya 0 2020-08-21 11:23
> s3a://dimajix-tmp/tmp/some_file.txt
> {code}
> h3. Reproduce questionable behaviour: Try to recursively delete directory
> {code:bash}
> # Bug: Now try to delete the directory with Hadoop CLI
> $ /opt/hadoop/bin/hdfs dfs -rm s3a://dimajix-tmp/tmp/
> rm: `s3a://dimajix-tmp/tmp': Is a directory
> # Okay, that didn't work out, Hadoop interprets the prefix as a directory
> (which is fine). It also did not delete anything, as we can see in S3:
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:25:40 0
> 2020-08-21 11:23:35 0 some_file.txt
> # Now let's try a little bit harder by trying to recursively delete the
> directory
> $ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
> Deleted s3a://dimajix-tmp/tmp
> # Everything looked fine so far. But let's inspect S3 directly. We'll find
> that only the prefix (fake directory entry) has been removed. The file in the
> directory is still there.
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:23:35 0 some_file.txt
> # We can also use Hadoop CLI to check that the directory containing the file
> is still present, although we wanted to delete it above.
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> Found 1 items
> -rw-rw-rw- 1 kaya kaya 0 2020-08-21 11:23
> s3a://dimajix-tmp/tmp/some_file.txt
> {code}
> h3. Remedy by performing second delete
> {code:bash}
> # Now let's perform the same action again to remove the directory and all its
> contents
> $ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
> Deleted s3a://dimajix-tmp/tmp
> # Finally everything was cleaned up.
> $ aws s3 ls s3://dimajix-tmp/tmp/
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> ls: `s3a://dimajix-tmp/tmp/': No such file or directory
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/
>
>
> {code}
> h2. Actual Behaviour vs Expected Behaviour
> When trying to recursively remove a directory using Hadoop CLI, I expect that
> the S3 prefix and all objects under that prefix are removed from S3. But
> actually only the prefix (fake directory entry) itself is removed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]