[jira] [Commented] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

Mithun Radhakrishnan (JIRA) Tue, 19 Aug 2014 14:03:36 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102834#comment-14102834
 ]


Mithun Radhakrishnan commented on HIVE-7223:
--------------------------------------------

About the problems you raised:

{quote}initializeAddedPartition, you are duplicating a lot of functionality 
here. Isn't there a way to refactor this so that it shares code with the 
existing initializeAddPartition? Duplication of functionality is a grievous 
evil that leads to many bugs. Same comment applies to your changes to 
MetaStoreUtils.updatePartitionStatsFast, and ObjectStore.addPartitions.{quote}

I'm afraid there's no way around that. I hoped to use a function template to 
use a single function, but alas... Java. The only recourse is to pass in an 
{{Object}}, and downcast it. I'm not keen on that, but I'll go along if that's 
what you advise.

{quote}get_partitions_pspec, why is there a call in here to get_partitions? 
That's going to fetch all of the partitions as before, correct? I thought part 
of the goal of this...{quote}

Yes, one of our goals was not to have the full List<Partition> instantiated. 
(We also wanted to reduce redundant wire-traffic, etc.) There might not be a 
way to avoid this construction in the {{get_partitions_pspec()}} path without 
invasive changes to MPartition, etc. (i.e. the db layer), unless I've missed 
something. On the bright side, the full list is constructed only in the 
{{get_partitions_pspec*()}} path. {{add_partitions_pspec()}} honours the 
iterator semantics. How might I otherwise implement {{get_partitions_pspec()}}?

{quote} PartValEqWrapperLite.equals, you're not checking that location has the 
same value.{quote}

That's actually by design. We're using the PartValEqWrapperLite (just as the 
PartValEqWrapper class before it) to identify repetition of Partitions in the 
list specified by the caller. 2 Partitions with the same partition-key sets are 
considered identical even if their locations differ.

{quote} PartValEqWrapperLite.add_partitions_pspec_core, the entry log message 
is wrong (looks like a copy/paste error), and probably belongs in 
add_partitions_pspec.{quote}

Clumsy. Fixed. (I'll up the new patch shortly.)

{quote} PartValEqWrapperLite.add_partitions_pspec_core, remove dead commented 
out code (line 2047) {quote}

Removed.

> Support generic PartitionSpecs in Metastore partition-functions
> ---------------------------------------------------------------
>
>                 Key: HIVE-7223
>                 URL: https://issues.apache.org/jira/browse/HIVE-7223
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog, Metastore
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch
>
>
> Currently, the functions in the HiveMetaStore API that handle multiple 
> partitions do so using List<Partition>. E.g. 
> {code}
> public List<Partition> listPartitions(String db_name, String tbl_name, short 
> max_parts);
> public List<Partition> listPartitionsByFilter(String db_name, String 
> tbl_name, String filter, short max_parts);
> public int add_partitions(List<Partition> new_parts);
> {code}
> Partition objects are fairly heavyweight, since each Partition carries its 
> own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
> thousands of partitions take so long to have their partitions listed that the 
> client times out with default hive.metastore.client.socket.timeout. There is 
> the additional expense of serializing and deserializing metadata for large 
> sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
> should help in this regard.
> In a date-partitioned table, all sub-partitions for a particular date are 
> *likely* (but not expected) to have:
> # The same base directory (e.g. {{/feeds/search/20140601/}})
> # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
> # The same SerDe/StorageHandler/IOFormat classes
> # Sorting/Bucketing/SkewInfo settings
> In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
> represent the partition-list (for a date) in a more condensed form: a list of 
> LighterPartition instances, all sharing a common StorageDescriptor whose 
> location points to the root directory. 
> We can go one better for the {{add_partitions()}} case: When adding all 
> partitions for a given date, the “normal” case affords us the ability to 
> specify the top-level date-directory, where sub-partitions can be inferred 
> from the HDFS directory-path.
> These extensions are hard to introduce at the metastore-level, since 
> partition-functions explicitly specify {{List<Partition>}} arguments. I 
> wonder if a {{PartitionSpec}} interface might help:
> {code}
> public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
> ; 
> public int add_partitions( PartitionSpec new_parts ) throws … ;
> {code}
> where the PartitionSpec looks like:
> {code}
> public interface PartitionSpec {
>         public List<Partition> getPartitions();
>         public List<String> getPartNames();
>         public Iterator<Partition> getPartitionIter();
>         public Iterator<String> getPartNameIter();
> }
> {code}
> For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
> {{PartitionSpec}}, store a top-level directory, and return Partition 
> instances from sub-directory names, while storing a single StorageDescriptor 
> for all of them.
> Similarly, list_partitions() could return a List<PartitionSpec>, where each 
> PartitionSpec corresponds to a set or partitions that can share a 
> StorageDescriptor.
> By exposing iterator semantics, neither the client nor the metastore need 
> instantiate all partitions at once. That should help with memory requirements.
> In case no smart grouping is possible, we could just fall back on a 
> {{DefaultPartitionSpec}} which composes {{List<Partition>}}, and is no worse 
> than status quo.
> PartitionSpec abstracts away how a set of partitions may be represented. A 
> tighter representation allows us to communicate metadata for a larger number 
> of Partitions, with less Thrift traffic.
> Given that Thrift doesn’t support polymorphism, we’d have to implement the 
> PartitionSpec as a Thrift Union of supported implementations. (We could 
> convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
> sub-class.)
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

Reply via email to