[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-18 Thread Sushanth Sowmyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-7341:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed. Thanks, Mithun!

(@Lefty: There isn't much of a need of end-user documentation for this patch, 
but possibly a programmer documentation aspect, which should mostly be covered 
by javadocs and the bug report here)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, 
 HIVE-7341.4.patch, HIVE-7341.5.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-13 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Status: Open  (was: Patch Available)

Thanks for the review, Sush. :] I'll post an updated patch with the log-message 
modified.

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, 
 HIVE-7341.4.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-13 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Attachment: HIVE-7341.5.patch

Updated patch, with a better log-message (about Tables with StorageHandlers 
specified.)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, 
 HIVE-7341.4.patch, HIVE-7341.5.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-13 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Status: Patch Available  (was: Open)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, 
 HIVE-7341.4.patch, HIVE-7341.5.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Status: Open  (was: Patch Available)

Need to rebase. (Plus, the file-format needs to be specified now as orcfile 
instead of orc).

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Attachment: HIVE-7341.4.patch

Rebased. 

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, 
 HIVE-7341.4.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Status: Patch Available  (was: Open)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, 
 HIVE-7341.4.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Attachment: HIVE-7341.3.patch

Added documentation for MetadataSerializer, and subclass.

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Status: Open  (was: Patch Available)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-08-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Status: Patch Available  (was: Open)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-07-30 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Attachment: HIVE-7341.2.patch

Improved patch, to ensure deprecated APIs still function.

{{HCatAddPartitionDesc.create(db, table, location, partKeyValMap)}} doesn't 
throw an UnsupportedException now.

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-07-30 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Status: Patch Available  (was: Open)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-07-07 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Attachment: (was: HIVE-7341.1.patch)

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-07-07 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Attachment: HIVE-7341.1.patch

Updated patch with the missing class.

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {{HCatClient.addPartitions()}} API requires that the 
 partition's schema be derived from the table's schema, thereby requiring that 
 the table-schema be resolved *before* partitions with the new schema are 
 added to the table. This is problematic, because it introduces race 
 conditions when 2 partitions with differing column-schemas (e.g. right after 
 a schema change) are copied in parallel. This can be avoided if each 
 HCatAddPartitionDesc kept track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-07-02 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Attachment: HIVE-7341.1.patch

The tentative first version of the fix.

 Support for Table replication across HCatalog instances
 ---

 Key: HIVE-7341
 URL: https://issues.apache.org/jira/browse/HIVE-7341
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Affects Versions: 0.13.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7341.1.patch


 The HCatClient currently doesn't provide very much support for replicating 
 HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
 instances. 
 Systems similar to Apache Falcon might find the need to replicate partition 
 data between 2 clusters, and keep the HCatalog metadata in sync between the 
 two. This poses a couple of problems:
 # The definition of the source table might change (in column schema, I/O 
 formats, record-formats, serde-parameters, etc.) The system will need a way 
 to diff 2 tables and update the target-metastore with the changes. E.g. 
 {code}
 targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
 hcatClient.updateTableSchema(dbName, tableName, targetTable);
 {code}
 # The current {HCatClient.addPartitions()} API requires that the partition's 
 schema be derived from the table's schema, thereby requiring that the 
 table-schema be resolved *before* partitions with the new schema are added to 
 the table. This is problematic, because it introduces race conditions when 2 
 partitions with differing column-schemas (e.g. right after a schema change) 
 are copied in parallel. This can be avoided if each HCatAddPartitionDesc kept 
 track of the partition's schema, in flight.
 # The source and target metastores might be running different/incompatible 
 versions of Hive. 
 The impending patch attempts to address these concerns (with some caveats).
 # {{HCatTable}} now has 
 ## a {{diff()}} method, to compare against another HCatTable instance
 ## a {{resolve(diff)}} method to copy over specified table-attributes from 
 another HCatTable
 ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
 {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed 
 in other class-loaders may be used for comparison
 # {{HCatPartition}} now provides finer-grained control over a Partition's 
 column-schema, StorageDescriptor settings, etc. This allows partitions to be 
 copied completely from source, with the ability to override specific 
 properties if required (e.g. location).
 # {{HCatClient.updateTableSchema()}} can now update the entire 
 table-definition, not just the column schema.
 # I've cleaned up and removed most of the redundancy between the HCatTable, 
 HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
 separate the table-attributes from the add-table-operation's attributes. By 
 providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
 in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
 deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
 HCatAddPartitionDesc.
 I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

2014-07-02 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7341:
---

Description: 
The HCatClient currently doesn't provide very much support for replicating 
HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
instances. 

Systems similar to Apache Falcon might find the need to replicate partition 
data between 2 clusters, and keep the HCatalog metadata in sync between the 
two. This poses a couple of problems:

# The definition of the source table might change (in column schema, I/O 
formats, record-formats, serde-parameters, etc.) The system will need a way to 
diff 2 tables and update the target-metastore with the changes. E.g. 
{code}
targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
hcatClient.updateTableSchema(dbName, tableName, targetTable);
{code}
# The current {{HCatClient.addPartitions()}} API requires that the partition's 
schema be derived from the table's schema, thereby requiring that the 
table-schema be resolved *before* partitions with the new schema are added to 
the table. This is problematic, because it introduces race conditions when 2 
partitions with differing column-schemas (e.g. right after a schema change) are 
copied in parallel. This can be avoided if each HCatAddPartitionDesc kept track 
of the partition's schema, in flight.
# The source and target metastores might be running different/incompatible 
versions of Hive. 

The impending patch attempts to address these concerns (with some caveats).

# {{HCatTable}} now has 
## a {{diff()}} method, to compare against another HCatTable instance
## a {{resolve(diff)}} method to copy over specified table-attributes from 
another HCatTable
## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
{{HCatClient.deserializeTable()}}), so that HCatTable instances constructed in 
other class-loaders may be used for comparison
# {{HCatPartition}} now provides finer-grained control over a Partition's 
column-schema, StorageDescriptor settings, etc. This allows partitions to be 
copied completely from source, with the ability to override specific properties 
if required (e.g. location).
# {{HCatClient.updateTableSchema()}} can now update the entire 
table-definition, not just the column schema.
# I've cleaned up and removed most of the redundancy between the HCatTable, 
HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to 
separate the table-attributes from the add-table-operation's attributes. By 
providing fluent-interfaces in HCatTable, and composing an HCatTable instance 
in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are 
deprecated, in favour of those in HCatTable. Likewise, HCatPartition and 
HCatAddPartitionDesc.

I'll post a patch for trunk shortly.

  was:
The HCatClient currently doesn't provide very much support for replicating 
HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) 
instances. 

Systems similar to Apache Falcon might find the need to replicate partition 
data between 2 clusters, and keep the HCatalog metadata in sync between the 
two. This poses a couple of problems:

# The definition of the source table might change (in column schema, I/O 
formats, record-formats, serde-parameters, etc.) The system will need a way to 
diff 2 tables and update the target-metastore with the changes. E.g. 
{code}
targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
hcatClient.updateTableSchema(dbName, tableName, targetTable);
{code}
# The current {HCatClient.addPartitions()} API requires that the partition's 
schema be derived from the table's schema, thereby requiring that the 
table-schema be resolved *before* partitions with the new schema are added to 
the table. This is problematic, because it introduces race conditions when 2 
partitions with differing column-schemas (e.g. right after a schema change) are 
copied in parallel. This can be avoided if each HCatAddPartitionDesc kept track 
of the partition's schema, in flight.
# The source and target metastores might be running different/incompatible 
versions of Hive. 

The impending patch attempts to address these concerns (with some caveats).

# {{HCatTable}} now has 
## a {{diff()}} method, to compare against another HCatTable instance
## a {{resolve(diff)}} method to copy over specified table-attributes from 
another HCatTable
## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and 
{{HCatClient.deserializeTable()}}), so that HCatTable instances constructed in 
other class-loaders may be used for comparison
# {{HCatPartition}} now provides finer-grained control over a Partition's 
column-schema, StorageDescriptor settings, etc. This allows partitions to be 
copied completely from source, with the ability to override specific properties 
if required (e.g. location).
#