[ 
https://issues.apache.org/jira/browse/HADOOP-17217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaya Kupferschmidt updated HADOOP-17217:
----------------------------------------
    Summary: S3A FileSystem does not correctly delete directories with fake 
entries  (was: S3A FileSystem gets confused by fake directory entries)

> S3A FileSystem does not correctly delete directories with fake entries
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-17217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Kaya Kupferschmidt
>            Priority: Major
>
> h3. Summary
> We are facing an issue where the Hadoop S3A Filesystem gets confused by fake 
> directory objects in S3. Specifically trying to recursively remove a whole 
> Hadoop directory (i.e. all objects with the same S3 prefix) doesn't work 
> correctly.
> h2. Background
> We are using Alluxio together with S3 as our deep store. For some 
> infrastructure reasons we decided to directly write to S3 (bypassing Alluxio) 
> with our Spark applications, and only use Alluxio for reading data.
> When we directly write our results into S3 with Spark, everything is fine. 
> But once Alluxio accesses S3, it will create these fake directory entries. 
> When we now try to overwrite existing data in S3 with a Spark application, 
> the result is incorrect, since Spark will only remove the fake directory 
> entry, but not all other objects below that prefix in S3.
> Of course it is questionable if Alluxio is doing the right thing, but on the 
> other hand it seems that Hadoop also does not behave as we expected.
> h2. Steps to Reproduce
> The following steps only require AWS CLI and Hadoop CLI to reproduce the 
> issue we are facing:
> h3. Initial setup
> {code:bash}
> # First step: Create an new and empty bucket in S3
> $ aws s3 mb s3://dimajix-tmp                                                  
>                                                                               
>                             make_bucket: dimajix-tmp
> $ aws s3 ls
> 2020-08-21 11:19:50 dimajix-tmp
> # Upload some data
> $ aws s3 cp some_file.txt s3://dimajix-tmp/tmp/
> upload: ./some_file.txt to s3://dimajix-tmp/tmp/some_file.txt
> $ aws s3 ls s3://dimajix-tmp/tmp/                                             
>                                                                               
>                             2020-08-21 11:23:35          0 some_file.txt
> # Check that Hadoop can list the file
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/
> Found 1 items
> drwxrwxrwx   - kaya kaya          0 2020-08-21 11:24 s3a://dimajix-tmp/tmp
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> -rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
> s3a://dimajix-tmp/tmp/some_file.txt
> # Evil step: Create fake directory entry in S3
> $ aws s3api put-object --bucket dimajix-tmp --key tmp/
> {
>     "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\""
> }
> # Look into S3, ensure that fake directory entry was created
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:25:40          0
> 2020-08-21 11:23:35          0 some_file.txt
> # Look into S3 using Hadoop CLI, ensure that everything looks okay (which is 
> the case)
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> Found 1 items
> -rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
> s3a://dimajix-tmp/tmp/some_file.txt
> {code}
> h3. Reproduce questionable behaviour: Try to recursively delete directory
> {code:bash}
> # Bug: Now try to delete the directory with Hadoop CLI
> $ /opt/hadoop/bin/hdfs dfs -rm s3a://dimajix-tmp/tmp/
> rm: `s3a://dimajix-tmp/tmp': Is a directory
> # Okay, that didn't work out, Hadoop interprets the prefix as a directory 
> (which is fine). It also did not delete anything, as we can see in S3:
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:25:40          0
> 2020-08-21 11:23:35          0 some_file.txt
> # Now let's try a little bit harder by trying to recursively delete the 
> directory
> $ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
> Deleted s3a://dimajix-tmp/tmp
> # Everything looked fine so far. But let's inspect S3 directly. We'll find 
> that only the prefix (fake directory entry) has been removed. The file in the 
> directory is still there.
> $ aws s3 ls s3://dimajix-tmp/tmp/
> 2020-08-21 11:23:35          0 some_file.txt
> # We can also use Hadoop CLI to check that the directory containing the file 
> is still present, although we wanted to delete it above.
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> Found 1 items
> -rw-rw-rw-   1 kaya kaya          0 2020-08-21 11:23 
> s3a://dimajix-tmp/tmp/some_file.txt
> {code}
> h3. Remedy by performing second delete
> {code:bash}
> # Now let's perform the same action again to remove the directory and all its 
> contents
> $ /opt/hadoop/bin/hdfs dfs -rm -r s3a://dimajix-tmp/tmp/
> Deleted s3a://dimajix-tmp/tmp
> # Finally everything was cleaned up.
> $ aws s3 ls s3://dimajix-tmp/tmp/
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/tmp/
> ls: `s3a://dimajix-tmp/tmp/': No such file or directory
> $ /opt/hadoop/bin/hdfs dfs -ls s3a://dimajix-tmp/                             
>                                                                               
>                            
> {code}
> h2. Actual Behaviour vs Expected Behaviour
> When trying to recursively remove a directory using Hadoop CLI, I expect that 
> the S3 prefix and all objects under that prefix are removed from S3. But 
> actually only the prefix (fake directory entry) itself is removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to