[
https://issues.apache.org/jira/browse/HIVE-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584850#comment-16584850
]
Sushanth Sowmyan commented on HIVE-16266:
-----------------------------------------
Hi [~akolb], apologies if this reply is no longer accurate ([~anishek] or
[~sankarh] might be able to clarify if things have changed - I have not been
active with hive for a year now), but at the time that the repl subsystem was
written, that's correct, by intention.
The basic idea is this - hive has two types of tables : MANAGED, where hive is
responsible for the storage, and EXTERNAL, where some other external program is
responsible for the storage. A key way to think about this distinction is what
happens when you do a DROP TABLE. For MANAGED tables, if a DROP TABLE is
issued, hive should delete the data on hdfs, since we own and manage the data
as well. For EXTERNAL tables, we are guests, and some other tool is managing
the data, and thus, we should not touch it - we can drop the metadata, but we
leave the data on HDFS alone.
Now, in the case where we're replicating from a primary to a secondary, if the
table is a EXTERNAL table on the primary, then an external tool is managing it
on the primary. But what about the secondary? Since the secondary is being
"managed" by Hive Replication, and thus, Hive, we own and manage it, keeping it
in sync with the primary. Thus, by definition, the copy is MANAGED even if the
source is EXTERNAL. If we kept it EXTERNAL, we would start having some weird
midway behaviour that we'd have to add complex rules for - consider the same
deletion scenario:
If we have a DROP PARTITION on the source table, by definition, on the source,
we do not delete the data on source hdfs. The user will likely do a hdfs rm,
refresh the data and might do a ADD PARTITION of new data. Now, what about the
destination? Should we delete the data corresponding to that DROP PARTITION on
destination? If so, then it is consistent with behaviour for MANAGED, rather
than EXTERNAL, and thus, we should keep it as MANAGED. If not, then well, we
have leftover data sitting in hdfs in the same location, and if new data gets
added in, as a result of an upcoming ADD PARTITION, then the behaviour is
indeterminable depending on the user - it can be the correct new data, it can
be a partial merge or a weird append. That gets messy fast.
So, for this problem and other possible unexpected problems, we decided to be
consistent with the meaning of MANAGED and EXTERNAL, and always make repl
destinations MANAGED. :)
> Enable function metadata to be written during bootstrap
> -------------------------------------------------------
>
> Key: HIVE-16266
> URL: https://issues.apache.org/jira/browse/HIVE-16266
> Project: Hive
> Issue Type: Sub-task
> Components: repl
> Affects Versions: 2.2.0
> Reporter: anishek
> Assignee: anishek
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: HIVE-16266.1.patch, HIVE-16266.2.patch,
> HIVE-16266.3.patch
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)