[jira] [Updated] (HADOOP-17217) S3A FileSystem does not correctly delete directories with fake entries

Kaya Kupferschmidt (Jira) Fri, 21 Aug 2020 08:30:01 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kaya Kupferschmidt updated HADOOP-17217:
----------------------------------------
    Description: 
h3. Summary

We are facing an issue where the Hadoop S3A Filesystem gets confused by fake 
directory objects in S3. Specifically trying to recursively remove a whole 
Hadoop directory (i.e. all objects with the same S3 prefix) doesn't work as 
expected.
h2. Background

We are using Alluxio together with S3 as our deep store. For some 
infrastructure reasons we decided to directly write to S3 (bypassing Alluxio) 
with our Spark applications, and only use Alluxio for reading data.

When we directly write our results into S3 with Spark, everything is fine. But 
once Alluxio accesses S3, it will create these fake directory entries. When we 
now try to overwrite existing data in S3 with a Spark application, the result 
is incorrect, since Spark will only remove the fake directory entry, but not 
all other objects below that prefix in S3.

Of course it is questionable if Alluxio is doing the right thing, but on the 
other hand it seems that Hadoop also does not behave as we expected.
h2. Steps to Reproduce

The following steps only require AWS CLI and Hadoop CLI to reproduce the issue 
we are facing:
h3. Initial setup
{code:bash}
# First step: Create an new and empty bucket in S3
$ aws s3 mb s3://dimajix-tmp                                                    
                                                                                
                        make_bucket: dimajix-tmp

$ aws s3 ls
2020-08-21 11:19:50 dimajix-tmp

# Upload some data
$ aws s3 cp some_file.txt s3://dimajix-tmp/tmp/
upload: ./some_file.txt to s3://dimajix-tmp/tmp/some_file.txt

$ aws s3 ls s3://dimajix-tmp/tmp/                                               
                                                                                
                        2020-08-21 11:23:35          0 some_file.txt

# Check that Hadoop can list the file
$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/
Found 1 items
drwxrwxrwx   - kaya kaya          0 2020-08-21 11:24 s3a://dimajix-tmp/tmp

$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
-rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
s3a://dimajix-tmp/tmp/some_file.txt


# Evil step: Create fake directory entry in S3
$ aws s3api put-object --bucket dimajix-tmp --key tmp/
{
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
}

# Look into S3, ensure that fake directory entry was created
$ aws s3 ls s3://dimajix-tmp/tmp/
2020-08-21 11:25:40          0
2020-08-21 11:23:35          0 some_file.txt

# Look into S3 using Hadoop CLI, ensure that everything looks okay (which is 
the case)
$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
Found 1 items
-rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
s3a://dimajix-tmp/tmp/some_file.txt
{code}
h3. Reproduce questionable behaviour: Try to recursively delete directory
{code:bash}
# Bug: Now try to delete the directory with Hadoop CLI
$ /opt/hadoop/bin/hdfs dfs -rm s3a://dimajix-tmp/tmp/
rm: `s3a://dimajix-tmp/tmp': Is a directory

# Okay, that didn't work out, Hadoop interprets the prefix as a directory 
(which is fine). It also did not delete anything, as we can see in S3:
$ aws s3 ls s3://dimajix-tmp/tmp/
2020-08-21 11:25:40          0
2020-08-21 11:23:35          0 some_file.txt

# Now let's try a little bit harder by trying to recursively delete the 
directory
$ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
Deleted s3a://dimajix-tmp/tmp

# Everything looked fine so far. But let's inspect S3 directly. We'll find that 
only the prefix (fake directory entry) has been removed. The file in the 
directory is still there.
$ aws s3 ls s3://dimajix-tmp/tmp/
2020-08-21 11:23:35          0 some_file.txt

# We can also use Hadoop CLI to check that the directory containing the file is 
still present, although we wanted to delete it above.
$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
Found 1 items
-rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
s3a://dimajix-tmp/tmp/some_file.txt
{code}
h3. Remedy by performing second delete
{code:bash}
# Now let's perform the same action again to remove the directory and all its 
contents
$ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
Deleted s3a://dimajix-tmp/tmp

# Finally everything was cleaned up.
$ aws s3 ls s3://dimajix-tmp/tmp/

$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
ls: `s3a://dimajix-tmp/tmp/': No such file or directory

$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/                               
                                                                                
                       
{code}
h2. Actual Behaviour vs Expected Behaviour

When trying to recursively remove a directory using Hadoop CLI, I expect that 
the S3 prefix and all objects under that prefix are removed from S3. But 
actually only the prefix (fake directory entry) itself is removed.

  was:
h3. Summary

We are facing an issue where the Hadoop S3A Filesystem gets confused by fake 
directory objects in S3. Specifically trying to recursively remove a whole 
Hadoop directory (i.e. all objects with the same S3 prefix) doesn't work 
correctly.
h2. Background

We are using Alluxio together with S3 as our deep store. For some 
infrastructure reasons we decided to directly write to S3 (bypassing Alluxio) 
with our Spark applications, and only use Alluxio for reading data.

When we directly write our results into S3 with Spark, everything is fine. But 
once Alluxio accesses S3, it will create these fake directory entries. When we 
now try to overwrite existing data in S3 with a Spark application, the result 
is incorrect, since Spark will only remove the fake directory entry, but not 
all other objects below that prefix in S3.

Of course it is questionable if Alluxio is doing the right thing, but on the 
other hand it seems that Hadoop also does not behave as we expected.
h2. Steps to Reproduce

The following steps only require AWS CLI and Hadoop CLI to reproduce the issue 
we are facing:

h3. Initial setup
{code:bash}
# First step: Create an new and empty bucket in S3
$ aws s3 mb s3://dimajix-tmp                                                    
                                                                                
                        make_bucket: dimajix-tmp

$ aws s3 ls
2020-08-21 11:19:50 dimajix-tmp

# Upload some data
$ aws s3 cp some_file.txt s3://dimajix-tmp/tmp/
upload: ./some_file.txt to s3://dimajix-tmp/tmp/some_file.txt

$ aws s3 ls s3://dimajix-tmp/tmp/                                               
                                                                                
                        2020-08-21 11:23:35          0 some_file.txt

# Check that Hadoop can list the file
$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/
Found 1 items
drwxrwxrwx   - kaya kaya          0 2020-08-21 11:24 s3a://dimajix-tmp/tmp

$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
-rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
s3a://dimajix-tmp/tmp/some_file.txt


# Evil step: Create fake directory entry in S3
$ aws s3api put-object --bucket dimajix-tmp --key tmp/
{
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
}

# Look into S3, ensure that fake directory entry was created
$ aws s3 ls s3://dimajix-tmp/tmp/
2020-08-21 11:25:40          0
2020-08-21 11:23:35          0 some_file.txt

# Look into S3 using Hadoop CLI, ensure that everything looks okay (which is 
the case)
$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
Found 1 items
-rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
s3a://dimajix-tmp/tmp/some_file.txt
{code}

h3. Reproduce questionable behaviour: Try to recursively delete directory
{code:bash}
# Bug: Now try to delete the directory with Hadoop CLI
$ /opt/hadoop/bin/hdfs dfs -rm s3a://dimajix-tmp/tmp/
rm: `s3a://dimajix-tmp/tmp': Is a directory

# Okay, that didn't work out, Hadoop interprets the prefix as a directory 
(which is fine). It also did not delete anything, as we can see in S3:
$ aws s3 ls s3://dimajix-tmp/tmp/
2020-08-21 11:25:40          0
2020-08-21 11:23:35          0 some_file.txt

# Now let's try a little bit harder by trying to recursively delete the 
directory
$ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
Deleted s3a://dimajix-tmp/tmp

# Everything looked fine so far. But let's inspect S3 directly. We'll find that 
only the prefix (fake directory entry) has been removed. The file in the 
directory is still there.
$ aws s3 ls s3://dimajix-tmp/tmp/
2020-08-21 11:23:35          0 some_file.txt

# We can also use Hadoop CLI to check that the directory containing the file is 
still present, although we wanted to delete it above.
$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
Found 1 items
-rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
s3a://dimajix-tmp/tmp/some_file.txt
{code}

h3. Remedy by performing second delete
{code:bash}
# Now let's perform the same action again to remove the directory and all its 
contents
$ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
Deleted s3a://dimajix-tmp/tmp

# Finally everything was cleaned up.
$ aws s3 ls s3://dimajix-tmp/tmp/

$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
ls: `s3a://dimajix-tmp/tmp/': No such file or directory

$ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/                               
                                                                                
                       
{code}

h2. Actual Behaviour vs Expected Behaviour
When trying to recursively remove a directory using Hadoop CLI, I expect that 
the S3 prefix and all objects under that prefix are removed from S3. But 
actually only the prefix (fake directory entry) itself is removed.


> S3A FileSystem does not correctly delete directories with fake entries
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-17217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Kaya Kupferschmidt
>            Priority: Major
>
> h3. Summary
> We are facing an issue where the Hadoop S3A Filesystem gets confused by fake 
> directory objects in S3. Specifically trying to recursively remove a whole 
> Hadoop directory (i.e. all objects with the same S3 prefix) doesn't work as 
> expected.
> h2. Background
> We are using Alluxio together with S3 as our deep store. For some 
> infrastructure reasons we decided to directly write to S3 (bypassing Alluxio) 
> with our Spark applications, and only use Alluxio for reading data.
> When we directly write our results into S3 with Spark, everything is fine. 
> But once Alluxio accesses S3, it will create these fake directory entries. 
> When we now try to overwrite existing data in S3 with a Spark application, 
> the result is incorrect, since Spark will only remove the fake directory 
> entry, but not all other objects below that prefix in S3.
> Of course it is questionable if Alluxio is doing the right thing, but on the 
> other hand it seems that Hadoop also does not behave as we expected.
> h2. Steps to Reproduce
> The following steps only require AWS CLI and Hadoop CLI to reproduce the 
> issue we are facing:
> h3. Initial setup
> {code:bash}
> # First step: Create an new and empty bucket in S3
> $ aws s3 mb s3://dimajix-tmp                                                  
>                                                                               
>                             make_bucket: dimajix-tmp
> $ aws s3 ls
> 2020-08-21 11:19:50 dimajix-tmp
> # Upload some data
> $ aws s3 cp some_file.txt s3://dimajix-tmp/tmp/
> upload: ./some_file.txt to s3://dimajix-tmp/tmp/some_file.txt
> $ aws s3 ls s3://dimajix-tmp/tmp/                                             
>                                                                               
>                             2020-08-21 11:23:35          0 some_file.txt
> # Check that Hadoop can list the file
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/
> Found 1 items
> drwxrwxrwx   - kaya kaya          0 2020-08-21 11:24 s3a://dimajix-tmp/tmp
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> -rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
> s3a://dimajix-tmp/tmp/some_file.txt
> # Evil step: Create fake directory entry in S3
> $ aws s3api put-object --bucket dimajix-tmp --key tmp/
> {
>     "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
> }
> # Look into S3, ensure that fake directory entry was created
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:25:40          0
> 2020-08-21 11:23:35          0 some_file.txt
> # Look into S3 using Hadoop CLI, ensure that everything looks okay (which is 
> the case)
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> Found 1 items
> -rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
> s3a://dimajix-tmp/tmp/some_file.txt
> {code}
> h3. Reproduce questionable behaviour: Try to recursively delete directory
> {code:bash}
> # Bug: Now try to delete the directory with Hadoop CLI
> $ /opt/hadoop/bin/hdfs dfs -rm s3a://dimajix-tmp/tmp/
> rm: `s3a://dimajix-tmp/tmp': Is a directory
> # Okay, that didn't work out, Hadoop interprets the prefix as a directory 
> (which is fine). It also did not delete anything, as we can see in S3:
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:25:40          0
> 2020-08-21 11:23:35          0 some_file.txt
> # Now let's try a little bit harder by trying to recursively delete the 
> directory
> $ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
> Deleted s3a://dimajix-tmp/tmp
> # Everything looked fine so far. But let's inspect S3 directly. We'll find 
> that only the prefix (fake directory entry) has been removed. The file in the 
> directory is still there.
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:23:35          0 some_file.txt
> # We can also use Hadoop CLI to check that the directory containing the file 
> is still present, although we wanted to delete it above.
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> Found 1 items
> -rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
> s3a://dimajix-tmp/tmp/some_file.txt
> {code}
> h3. Remedy by performing second delete
> {code:bash}
> # Now let's perform the same action again to remove the directory and all its 
> contents
> $ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
> Deleted s3a://dimajix-tmp/tmp
> # Finally everything was cleaned up.
> $ aws s3 ls s3://dimajix-tmp/tmp/
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> ls: `s3a://dimajix-tmp/tmp/': No such file or directory
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/                             
>                                                                               
>                            
> {code}
> h2. Actual Behaviour vs Expected Behaviour
> When trying to recursively remove a directory using Hadoop CLI, I expect that 
> the S3 prefix and all objects under that prefix are removed from S3. But 
> actually only the prefix (fake directory entry) itself is removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-17217) S3A FileSystem does not correctly delete directories with fake entries

Reply via email to