[jira] [Commented] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

Alan Gates (JIRA) Sat, 30 Aug 2014 13:26:07 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116550#comment-14116550
 ]


Alan Gates commented on HIVE-7223:
----------------------------------

{quote} About the problems you raised:
{color:red} initializeAddedPartition, you are duplicating a lot of 
functionality here. Isn't there a way to refactor this so that it shares code 
with the existing initializeAddPartition? Duplication of functionality is a 
grievous evil that leads to many bugs. Same comment applies to your changes to 
MetaStoreUtils.updatePartitionStatsFast, and ObjectStore.addPartitions.
{color}
I'm afraid there's no way around that. I hoped to use a function template to 
use a single function, but alas... Java. The only recourse is to pass in an 
Object, and downcast it. I'm not keen on that, but I'll go along if that's what 
you advise.
{quote}
Couldn't you change initializeAddedPartiion(Table, Partition, boolean) to wrap 
the Partition object in a simple instance of 
PartitionSpecProxy.PartitionIterater and then call 
initializeAddedPartition(Table, Partition, boolean)?

{quote}{color:red}
get_partitions_pspec, why is there a call in here to get_partitions? That's 
going to fetch all of the partitions as before, correct? I thought part of the 
goal of this...{color}
Yes, one of our goals was not to have the full List<Partition> instantiated. 
(We also wanted to reduce redundant wire-traffic, etc.) There might not be a 
way to avoid this construction in the get_partitions_pspec() path without 
invasive changes to MPartition, etc. (i.e. the db layer), unless I've missed 
something. On the bright side, the full list is constructed only in the 
get_partitions_pspec*() path. add_partitions_pspec() honours the iterator 
semantics. How might I otherwise implement get_partitions_pspec()?
{quote}
ok

{quote}{color:red}
PartValEqWrapperLite.equals, you're not checking that location has the same 
value.{color}
That's actually by design. We're using the PartValEqWrapperLite (just as the 
PartValEqWrapper class before it) to identify repetition of Partitions in the 
list specified by the caller. 2 Partitions with the same partition-key sets are 
considered identical even if their locations differ.
{quote}
ok

> Support generic PartitionSpecs in Metastore partition-functions
> ---------------------------------------------------------------
>
>                 Key: HIVE-7223
>                 URL: https://issues.apache.org/jira/browse/HIVE-7223
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog, Metastore
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.patch
>
>
> Currently, the functions in the HiveMetaStore API that handle multiple 
> partitions do so using List<Partition>. E.g. 
> {code}
> public List<Partition> listPartitions(String db_name, String tbl_name, short 
> max_parts);
> public List<Partition> listPartitionsByFilter(String db_name, String 
> tbl_name, String filter, short max_parts);
> public int add_partitions(List<Partition> new_parts);
> {code}
> Partition objects are fairly heavyweight, since each Partition carries its 
> own copy of a StorageDescriptor, partition-values, etc. Tables with tens of 
> thousands of partitions take so long to have their partitions listed that the 
> client times out with default hive.metastore.client.socket.timeout. There is 
> the additional expense of serializing and deserializing metadata for large 
> sets of partitions, w.r.t time and heap-space. Reducing the thrift traffic 
> should help in this regard.
> In a date-partitioned table, all sub-partitions for a particular date are 
> *likely* (but not expected) to have:
> # The same base directory (e.g. {{/feeds/search/20140601/}})
> # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
> # The same SerDe/StorageHandler/IOFormat classes
> # Sorting/Bucketing/SkewInfo settings
> In this “most likely” scenario (henceforth termed “normal”), it’s possible to 
> represent the partition-list (for a date) in a more condensed form: a list of 
> LighterPartition instances, all sharing a common StorageDescriptor whose 
> location points to the root directory. 
> We can go one better for the {{add_partitions()}} case: When adding all 
> partitions for a given date, the “normal” case affords us the ability to 
> specify the top-level date-directory, where sub-partitions can be inferred 
> from the HDFS directory-path.
> These extensions are hard to introduce at the metastore-level, since 
> partition-functions explicitly specify {{List<Partition>}} arguments. I 
> wonder if a {{PartitionSpec}} interface might help:
> {code}
> public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... 
> ; 
> public int add_partitions( PartitionSpec new_parts ) throws … ;
> {code}
> where the PartitionSpec looks like:
> {code}
> public interface PartitionSpec {
>         public List<Partition> getPartitions();
>         public List<String> getPartNames();
>         public Iterator<Partition> getPartitionIter();
>         public Iterator<String> getPartNameIter();
> }
> {code}
> For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement 
> {{PartitionSpec}}, store a top-level directory, and return Partition 
> instances from sub-directory names, while storing a single StorageDescriptor 
> for all of them.
> Similarly, list_partitions() could return a List<PartitionSpec>, where each 
> PartitionSpec corresponds to a set or partitions that can share a 
> StorageDescriptor.
> By exposing iterator semantics, neither the client nor the metastore need 
> instantiate all partitions at once. That should help with memory requirements.
> In case no smart grouping is possible, we could just fall back on a 
> {{DefaultPartitionSpec}} which composes {{List<Partition>}}, and is no worse 
> than status quo.
> PartitionSpec abstracts away how a set of partitions may be represented. A 
> tighter representation allows us to communicate metadata for a larger number 
> of Partitions, with less Thrift traffic.
> Given that Thrift doesn’t support polymorphism, we’d have to implement the 
> PartitionSpec as a Thrift Union of supported implementations. (We could 
> convert from the Thrift PartitionSpec to the appropriate Java PartitionSpec 
> sub-class.)
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions

Reply via email to