[jira] [Created] (IMPALA-9293) Impala Doc: Revise explanation of HDFS trashcan usage on S3

Sahil Takiar (Jira) Mon, 13 Jan 2020 09:33:20 -0800

Sahil Takiar created IMPALA-9293:
------------------------------------

             Summary: Impala Doc: Revise explanation of HDFS trashcan usage on 
S3
                 Key: IMPALA-9293
                 URL: https://issues.apache.org/jira/browse/IMPALA-9293
             Project: IMPALA
          Issue Type: Task
          Components: Docs
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar



The Impala docs state:

{quote}
By default, when you drop an internal (managed) table, the data files are moved 
to the HDFS trashcan. This operation is expensive for tables that reside on the 
Amazon S3 filesystem. Therefore, for S3 tables, prefer to use DROP TABLE 
table_name PURGE rather than the default DROP TABLE statement. The PURGE clause 
makes Impala delete the data files immediately, skipping the HDFS trashcan.
{quote}

and

{quote}
The default DROP TABLE/PARTITION is slow because Impala copies the files to the 
HDFS trash folder, and Impala waits until all the data is moved. DROP 
TABLE/PARTITION .. PURGE is a fast delete operation, and the Impala statement 
finishes quickly even though the change might not have propagated fully 
throughout S3.
{quote}

The confusing part is "Impala copies the files to the HDFS trash folder". Users 
might think that when a managed Impala table on S3 is dropped, Impala actually 
copies the data from S3 to a trashcan folder *stored on HDFS*. This isn't true. 
The term "HDFS trashcan" is used to refer to a feature of HDFS where all 
deleted data is moved to a trash folder rather than being deleted immediately. 
See 
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#File+Deletes+and+Undeletes
 for details.

What actually happens is that there is a trashcan folder on S3 itself, and when 
a S3 managed table is dropped, the data is copied from from the managed table 
folder to the trashcan folder *stored on S3*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (IMPALA-9293) Impala Doc: Revise explanation of HDFS trashcan usage on S3

Reply via email to