[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-11-01 Thread Lefty Leverenz (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lefty Leverenz updated HIVE-7223:
-
Labels: TODOC14  (was: )

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
  Labels: TODOC14
 Fix For: 0.14.0

 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch, HIVE-7223.5.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-09-07 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated HIVE-7223:
-
Fix Version/s: 0.14.0

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 0.14.0

 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch, HIVE-7223.5.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-09-05 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated HIVE-7223:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Mithun.

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch, HIVE-7223.5.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-09-03 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---
Attachment: HIVE-7223.5.patch

I've taken your advice a step further, and removed redundancy in 
{{HiveMetaStore.initializeAddedPartition()}}, 
{{MetaStoreUtils.updatePartitionStatsFast()}}, and 
{{Warehouse.getFileStatusesForSD()}}. Cleaner, all around.

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch, HIVE-7223.5.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-09-03 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---
Status: Patch Available  (was: Open)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch, HIVE-7223.5.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-09-03 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---
Status: Open  (was: Patch Available)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch, HIVE-7223.5.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-09-02 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---
Attachment: HIVE-7223.4.patch

[~gates] Point taken on the repeated code in initializeAddedPartition(). I've 
introduced a wrapper, as you've advised. That cleaned things up.

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-09-02 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---
Status: Patch Available  (was: Open)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch, 
 HIVE-7223.4.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Status: Open  (was: Patch Available)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Attachment: HIVE-7223.3.patch

Still struggling to get this on reviews.apache.org.

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Attachment: (was: HIVE-7223.3.patch)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Attachment: HIVE-7223.3.patch

Here's the updated patch. I've removed some dead code and corrected the log 
message, as per Alan's advice.

Also, https://reviews.apache.org/r/24872/.

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Status: Open  (was: Patch Available)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Attachment: (was: HIVE-7223.2.patch)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Status: Patch Available  (was: Open)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-08 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Attachment: HIVE-7223.2.patch

Updated patch, with Thrift definitions updated, etc.

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-07 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Status: Open  (was: Patch Available)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-07 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Status: Patch Available  (was: Open)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-07 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Attachment: HIVE-7223.2.patch

Sorry, I didn't realize the generated code needed to be included. Here's the 
complete patch.
(This includes changes to *cpp, *php, etc., though.)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-06 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Attachment: HIVE-7223.1.patch

Here's the initial patch, with tests.

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.12.0, 0.13.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-08-06 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Status: Patch Available  (was: Open)

 Support generic PartitionSpecs in Metastore partition-functions
 ---

 Key: HIVE-7223
 URL: https://issues.apache.org/jira/browse/HIVE-7223
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog, Metastore
Affects Versions: 0.13.0, 0.12.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-7223.1.patch


 Currently, the functions in the HiveMetaStore API that handle multiple 
 partitions do so using ListPartition. E.g. 
 {code}
 public ListPartition listPartitions(String db_name, String tbl_name, short 
 max_parts);
 public ListPartition listPartitionsByFilter(String db_name, String 
 tbl_name, String filter, short max_parts);
 public int add_partitions(ListPartition new_parts);
 {code}
 Partition objects are fairly heavyweight, since each Partition carries its 
 own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
 thousands of partitions take so long to have their partitions listed that the 
 client times out with default hive.metastore.client.socket.timeout. There is 
 the additional expense of serializing and deserializing metadata for large 
 sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
 should help in this regard.
 In a date-partitioned table, all sub-partitions for a particular date are 
 *likely* (but not expected) to have:
 # The same base directory (e.g. {{/feeds/search/20140601/}})
 # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
 # The same SerDe/StorageHandler/IOFormat classes
 # Sorting/Bucketing/SkewInfo settings
 In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
 represent the partition-list (for a date) in a more condensed form: a list of 
 LighterPartition instances, all sharing a common StorageDescriptor whose 
 location points to the root directory. 
 We can go one better for the {{add_partitions()}} case: When adding all 
 partitions for a given date, the “normal” case affords us the ability to 
 specify the top-level date-directory, where sub-partitions can be inferred 
 from the HDFS directory-path.
 These extensions are hard to introduce at the metastore-level, since 
 partition-functions explicitly specify {{ListPartition}} arguments. I 
 wonder if a {{PartitionSpec}} interface might help:
 {code}
 public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
 ; 
 public int add_partitions( PartitionSpec new_parts ) throws … ;
 {code}
 where the PartitionSpec looks like:
 {code}
 public interface PartitionSpec {
 public ListPartition getPartitions();
 public ListString getPartNames();
 public IteratorPartition getPartitionIter();
 public IteratorString getPartNameIter();
 }
 {code}
 For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
 {{PartitionSpec}}, store a top-level directory, and return Partition 
 instances from sub-directory names, while storing a single StorageDescriptor 
 for all of them.
 Similarly, list_partitions() could return a ListPartitionSpec, where each 
 PartitionSpec corresponds to a set or partitions that can share a 
 StorageDescriptor.
 By exposing iterator semantics, neither the client nor the metastore need 
 instantiate all partitions at once. That should help with memory requirements.
 In case no smart grouping is possible, we could just fall back on a 
 {{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
 than status quo.
 PartitionSpec abstracts away how a set of partitions may be represented. A 
 tighter representation allows us to communicate metadata for a larger number 
 of Partitions, with less Thrift traffic.
 Given that Thrift doesn’t support polymorphism, we’d have to implement the 
 PartitionSpec as a Thrift Union of supported implementations. (We could 
 convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
 sub-class.)
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

2014-06-11 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---

Description: 
Currently, the functions in the HiveMetaStore API that handle multiple 
partitions do so using ListPartition. E.g. 
{code}
public ListPartition listPartitions(String db_name, String tbl_name, short 
max_parts);
public ListPartition listPartitionsByFilter(String db_name, String tbl_name, 
String filter, short max_parts);
public int add_partitions(ListPartition new_parts);
{code}

Partition objects are fairly heavyweight, since each Partition carries its own 
copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
thousands of partitions take so long to have their partitions listed that the 
client times out with default hive.metastore.client.socket.timeout. There is 
the additional expense of serializing and deserializing metadata for large sets 
of partitions, w.r.t time and heap-space. Reducing the thrift traffic should 
help in this regard.

In a date-partitioned table, all sub-partitions for a particular date are 
*likely* (but not expected) to have:

# The same base directory (e.g. {{/feeds/search/20140601/}})
# Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
# The same SerDe/StorageHandler/IOFormat classes
# Sorting/Bucketing/SkewInfo settings

In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
represent the partition-list (for a date) in a more condensed form: a list of 
LighterPartition instances, all sharing a common StorageDescriptor whose 
location points to the root directory. 

We can go one better for the {{add_partitions()}} case: When adding all 
partitions for a given date, the “normal” case affords us the ability to 
specify the top-level date-directory, where sub-partitions can be inferred from 
the HDFS directory-path.

These extensions are hard to introduce at the metastore-level, since 
partition-functions explicitly specify {{ListPartition}} arguments. I wonder 
if a {{PartitionSpec}} interface might help:

{code}
public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... ; 
public int add_partitions( PartitionSpec new_parts ) throws … ;
{code}

where the PartitionSpec looks like:

{code}
public interface PartitionSpec {
public ListPartition getPartitions();
public ListString getPartNames();
public IteratorPartition getPartitionIter();
public IteratorString getPartNameIter();
}
{code}

For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
{{PartitionSpec}}, store a top-level directory, and return Partition instances 
from sub-directory names, while storing a single StorageDescriptor for all of 
them.

Similarly, list_partitions() could return a ListPartitionSpec, where each 
PartitionSpec corresponds to a set or partitions that can share a 
StorageDescriptor.

By exposing iterator semantics, neither the client nor the metastore need 
instantiate all partitions at once. That should help with memory requirements.

In case no smart grouping is possible, we could just fall back on a 
{{DefaultPartitionSpec}} which composes {{ListPartition}}, and is no worse 
than status quo.

PartitionSpec abstracts away how a set of partitions may be represented. A 
tighter representation allows us to communicate metadata for a larger number of 
Partitions, with less Thrift traffic.

Given that Thrift doesn’t support polymorphism, we’d have to implement the 
PartitionSpec as a Thrift Union of supported implementations. (We could convert 
from the Thrift PartitionSpec to the appropriate Java PartitionSpec sub-class.)

Thoughts?


  was:
Currently, the functions in the HiveMetaStore API that handle multiple 
partitions do so using ListPartition. E.g. 
{code}
public ListPartition listPartitions(String db_name, String tbl_name, short 
max_parts);
public ListPartition listPartitionsByFilter(String db_name, String tbl_name, 
String filter, short max_parts);
public int add_partitions(ListPartition new_parts);
{code}

Partition objects are fairly heavyweight, since each Partition carries its own 
copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
thousands of partitions take so long to have their partitions listed that the 
client times out with default hive.metastore.client.socket.timeout. There is 
the additional expense of serializing and deserializing metadata for large sets 
of partitions, w.r.t time and heap-space. Reducing the thrift traffic should 
help in this regard.

In a date-partitioned table, all sub-partitions for a particular date are 
*likely* (but not expected) to have:

# The same base directory (e.g. {{/feeds/search/20140601/}})
# Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
# The same SerDe/StorageHandler/IOFormat classes
# Sorting/Bucketing/SkewInfo settings

In this “most