[ 
https://issues.apache.org/jira/browse/IMPALA-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014673#comment-17014673
 ] 

ASF subversion and git services commented on IMPALA-9293:
---------------------------------------------------------

Commit 64828f8b765ad9a8f2cc0f4aede9a1d4def235d7 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=64828f8 ]

IMPALA-9293: [DOCS] Impala Doc: Revise explanation of HDFS trashcan usage on S3

Updated impala_s3.xml to refer to the trashcan as the "S3A trashcan"
rather than the "HDFS trashcan".

Change-Id: If321117b0d58e3f6d79251fad97b8bd92882cc12
Reviewed-on: http://gerrit.cloudera.org:8080/15022
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Joe McDonnell <[email protected]>


> Impala Doc: Revise explanation of HDFS trashcan usage on S3
> -----------------------------------------------------------
>
>                 Key: IMPALA-9293
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9293
>             Project: IMPALA
>          Issue Type: Task
>          Components: Docs
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> The Impala docs state:
> {quote}
> By default, when you drop an internal (managed) table, the data files are 
> moved to the HDFS trashcan. This operation is expensive for tables that 
> reside on the Amazon S3 filesystem. Therefore, for S3 tables, prefer to use 
> DROP TABLE table_name PURGE rather than the default DROP TABLE statement. The 
> PURGE clause makes Impala delete the data files immediately, skipping the 
> HDFS trashcan.
> {quote}
> and
> {quote}
> The default DROP TABLE/PARTITION is slow because Impala copies the files to 
> the HDFS trash folder, and Impala waits until all the data is moved. DROP 
> TABLE/PARTITION .. PURGE is a fast delete operation, and the Impala statement 
> finishes quickly even though the change might not have propagated fully 
> throughout S3.
> {quote}
> The confusing part is "Impala copies the files to the HDFS trash folder". 
> Users might think that when a managed Impala table on S3 is dropped, Impala 
> actually copies the data from S3 to a trashcan folder *stored on HDFS*. This 
> isn't true. The term "HDFS trashcan" is used to refer to a feature of HDFS 
> where all deleted data is moved to a trash folder rather than being deleted 
> immediately. See 
> https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#File+Deletes+and+Undeletes
>  for details.
> What actually happens is that there is a trashcan folder on S3 itself, and 
> when a S3 managed table is dropped, the data is copied from from the managed 
> table folder to the trashcan folder *stored on S3*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to