[
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HIVE-21761 started by Sankar Hariappan.
-----------------------------------------------
> Support table level replication in Hive
> ---------------------------------------
>
> Key: HIVE-21761
> URL: https://issues.apache.org/jira/browse/HIVE-21761
> Project: Hive
> Issue Type: New Feature
> Components: repl
> Reporter: Sankar Hariappan
> Assignee: Sankar Hariappan
> Priority: Major
> Labels: DR, Replication
>
> *Requirements:*
> {code}
> - User needs to define replication policy to replicate any specific table.
> This enables user to replicate only the business critical tables instead of
> replicating all tables which may throttle the network bandwidth, storage and
> also slow-down Hive replication.
> - User needs to define replication policy using regular expressions (such as
> db.sales_*) and needs to include additional tables which are non-matching
> given pattern and exclude some tables which are matching given pattern.
> - User needs to dynamically add/remove tables to the list either by manually
> changing the replication policy during run time.
> {code}
> *Design:*
> {code}
> 1. Hive continue to support DB level replication policy of format <db_name>.*
> but logically, we support the policy as <db_name>.(t1, t3, …).
> 2. Regular expression can also be supported as replication policy. For
> example,
> a. <db_name>.<prefix*>,
> b. <db_name>.<*suffix>,
> c. <db_name>.<prefix*suffix>.
> 3. If regular expression is provided as replication policy, then Hive also
> accepts include and exclude lists as input which also helps to dynamically
> add/remove tables for replication.
> a. Exclude list specifies the tables to be excluded even if it satisfies
> the regular expression.
> b. Include list specifies the tables to be included in addition to the
> tables satisfying the regular expression.
> 4. New format for the Replication policy have 3 parts all separated with Dot
> (.).
> a. First part is DB name.
> b. Second part is included list. Comma separated table names/regex with in
> square brackets[].
> c. Third part is excluded list. Comma separated table names/regex with in
> square brackets[].
> - <db_name> -- Full DB replication
> - <db_name>.* -- Full DB replication
> - <db_name>.[t1, t3] -- DB replication with static list of tables t1 and
> t3 included.
> - <db_name>.[t1*, t2].[t100] -- DB replication with all tables having
> prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude
> t100 which has the prefix t1.
> 5. If the DB property “repl.source.for” is set, then by default all the
> tables in the DB will be enabled for replication and will continue to archive
> deleted data to CM path.
> 6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
> a. REPL DUMP <current_repl_policy> [REPLACE <previous_repl_policy> FROM
> <last_repl_id> WITH <key_values_list>;
> current_repl_policy and previous_repl_policy can be any format mentioned in
> Point-4.
> b. REPLACE clause to be supported to take previous repl policy as input.
> c. Rest of the format remains same.
> 7. Now, REPL DUMP on this DB will replicate the tables based on
> current_repl_policy.
> 8. If any table is added dynamically either due to change in regular
> expression or added to include list should be bootstrapped.
> a. Hive will automatically figure out the list of tables newly included in
> the list by comparing the current_repl_policy & previous_repl_policy inputs
> and combine bootstrap dump for added tables as part of incremental dump. As
> we can combine first incremental with bootstrap dump, it removes the current
> limitation of target DB being inconsistent after bootstrap unless we run
> first incremental replication.
> b. If any table is renamed, then it may gets dynamically added/removed for
> replication based on defined replication policy + include/exclude list. So,
> Hive will perform bootstrap for the table which is just included after
> rename.
> c. Also, if renamed table is excluded from replication policy, then need to
> drop the old table at target as well.
> 9. Only the initial bootstrap load expects the target DB to be empty but the
> intermediate bootstrap on tables due to regex or inclusion/exclusion list
> change or renames doesn’t expect the target DB or table to be empty. If any
> table with same name exist during such bootstrap, the table will be
> overwritten including data.
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)