Sahil Takiar created IMPALA-9293:
------------------------------------
Summary: Impala Doc: Revise explanation of HDFS trashcan usage on
S3
Key: IMPALA-9293
URL: https://issues.apache.org/jira/browse/IMPALA-9293
Project: IMPALA
Issue Type: Task
Components: Docs
Reporter: Sahil Takiar
Assignee: Sahil Takiar
The Impala docs state:
{quote}
By default, when you drop an internal (managed) table, the data files are moved
to the HDFS trashcan. This operation is expensive for tables that reside on the
Amazon S3 filesystem. Therefore, for S3 tables, prefer to use DROP TABLE
table_name PURGE rather than the default DROP TABLE statement. The PURGE clause
makes Impala delete the data files immediately, skipping the HDFS trashcan.
{quote}
and
{quote}
The default DROP TABLE/PARTITION is slow because Impala copies the files to the
HDFS trash folder, and Impala waits until all the data is moved. DROP
TABLE/PARTITION .. PURGE is a fast delete operation, and the Impala statement
finishes quickly even though the change might not have propagated fully
throughout S3.
{quote}
The confusing part is "Impala copies the files to the HDFS trash folder". Users
might think that when a managed Impala table on S3 is dropped, Impala actually
copies the data from S3 to a trashcan folder *stored on HDFS*. This isn't true.
The term "HDFS trashcan" is used to refer to a feature of HDFS where all
deleted data is moved to a trash folder rather than being deleted immediately.
See
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#File+Deletes+and+Undeletes
for details.
What actually happens is that there is a trashcan folder on S3 itself, and when
a S3 managed table is dropped, the data is copied from from the managed table
folder to the trashcan folder *stored on S3*.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]