[ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-7223:
---------------------------------------

    Status: Open  (was: Patch Available)

> Support generic PartitionSpecs in Metastore partition-functions
> ---------------------------------------------------------------
>
>                 Key: HIVE-7223
>                 URL: https://issues.apache.org/jira/browse/HIVE-7223
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog, Metastore
>    Affects Versions: 0.13.0, 0.12.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch
>
>
> Currently, the functions in the HiveMetaStore API that handle multiple 
> partitions do so using List<Partition>. E.g. 
> {code}
> public List<Partition> listPartitions(String db_name, String tbl_name, short 
> max_parts);
> public List<Partition> listPartitionsByFilter(String db_name, String 
> tbl_name, String filter, short max_parts);
> public int add_partitions(List<Partition> new_parts);
> {code}
> Partition objects are fairly heavyweight, since each Partition carries its 
> own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
> thousands of partitions take so long to have their partitions listed that the 
> client times out with default hive.metastore.client.socket.timeout. There is 
> the additional expense of serializing and deserializing metadata for large 
> sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
> should help in this regard.
> In a date-partitioned table, all sub-partitions for a particular date are 
> *likely* (but not expected) to have:
> # The same base directory (e.g. {{/feeds/search/20140601/}})
> # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
> # The same SerDe/StorageHandler/IOFormat classes
> # Sorting/Bucketing/SkewInfo settings
> In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
> represent the partition-list (for a date) in a more condensed form: a list of 
> LighterPartition instances, all sharing a common StorageDescriptor whose 
> location points to the root directory. 
> We can go one better for the {{add_partitions()}} case: When adding all 
> partitions for a given date, the “normal” case affords us the ability to 
> specify the top-level date-directory, where sub-partitions can be inferred 
> from the HDFS directory-path.
> These extensions are hard to introduce at the metastore-level, since 
> partition-functions explicitly specify {{List<Partition>}} arguments. I 
> wonder if a {{PartitionSpec}} interface might help:
> {code}
> public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
> ; 
> public int add_partitions( PartitionSpec new_parts ) throws … ;
> {code}
> where the PartitionSpec looks like:
> {code}
> public interface PartitionSpec {
>         public List<Partition> getPartitions();
>         public List<String> getPartNames();
>         public Iterator<Partition> getPartitionIter();
>         public Iterator<String> getPartNameIter();
> }
> {code}
> For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
> {{PartitionSpec}}, store a top-level directory, and return Partition 
> instances from sub-directory names, while storing a single StorageDescriptor 
> for all of them.
> Similarly, list_partitions() could return a List<PartitionSpec>, where each 
> PartitionSpec corresponds to a set or partitions that can share a 
> StorageDescriptor.
> By exposing iterator semantics, neither the client nor the metastore need 
> instantiate all partitions at once. That should help with memory requirements.
> In case no smart grouping is possible, we could just fall back on a 
> {{DefaultPartitionSpec}} which composes {{List<Partition>}}, and is no worse 
> than status quo.
> PartitionSpec abstracts away how a set of partitions may be represented. A 
> tighter representation allows us to communicate metadata for a larger number 
> of Partitions, with less Thrift traffic.
> Given that Thrift doesn’t support polymorphism, we’d have to implement the 
> PartitionSpec as a Thrift Union of supported implementations. (We could 
> convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
> sub-class.)
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to