[ 
https://issues.apache.org/jira/browse/HIVE-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584850#comment-16584850
 ] 

Sushanth Sowmyan commented on HIVE-16266:
-----------------------------------------

Hi [~akolb], apologies if this reply is no longer accurate ([~anishek] or 
[~sankarh] might be able to clarify if things have changed - I have not been 
active with hive for a year now), but at the time that the repl subsystem was 
written, that's correct, by intention.

The basic idea is this - hive has two types of tables : MANAGED, where hive is 
responsible for the storage, and EXTERNAL, where some other external program is 
responsible for the storage. A key way to think about this distinction is what 
happens when you do a DROP TABLE. For MANAGED tables, if a DROP TABLE is 
issued, hive should delete the data on hdfs, since we own and manage the data 
as well. For EXTERNAL tables, we are guests, and some other tool is managing 
the data, and thus, we should not touch it - we can drop the metadata, but we 
leave the data on HDFS alone.

Now, in the case where we're replicating from a primary to a secondary, if the 
table is a EXTERNAL table on the primary, then an external tool is managing it 
on the primary. But what about the secondary? Since the secondary is being 
"managed" by Hive Replication, and thus, Hive, we own and manage it, keeping it 
in sync with the primary. Thus, by definition, the copy is MANAGED even if the 
source is EXTERNAL. If we kept it EXTERNAL, we would start having some weird 
midway behaviour that we'd have to add complex rules for - consider the same 
deletion scenario:

If we have a DROP PARTITION on the source table, by definition, on the source, 
we do not delete the data on source hdfs. The user will likely do a hdfs rm, 
refresh the data and might do a ADD PARTITION of new data. Now, what about the 
destination? Should we delete the data corresponding to that DROP PARTITION on 
destination? If so, then it is consistent with behaviour for MANAGED, rather 
than EXTERNAL, and thus, we should keep it as MANAGED. If not, then well, we 
have leftover data sitting in hdfs in the same location, and if new data gets 
added in, as a result of an upcoming ADD PARTITION, then the behaviour is 
indeterminable depending on the user - it can be the correct new data, it can 
be a partial merge or a weird append. That gets messy fast.

So, for this problem and other possible unexpected problems, we decided to be 
consistent with the meaning of MANAGED and EXTERNAL, and always make repl 
destinations MANAGED. :) 

 

> Enable function metadata to be written during bootstrap
> -------------------------------------------------------
>
>                 Key: HIVE-16266
>                 URL: https://issues.apache.org/jira/browse/HIVE-16266
>             Project: Hive
>          Issue Type: Sub-task
>          Components: repl
>    Affects Versions: 2.2.0
>            Reporter: anishek
>            Assignee: anishek
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: HIVE-16266.1.patch, HIVE-16266.2.patch, 
> HIVE-16266.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to