[jira] [Commented] (HIVE-19715) Consolidated and flexible API for fetching partition metadata from HMS

Vihang Karajgaonkar (JIRA) Fri, 03 Aug 2018 11:30:11 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568590#comment-16568590
 ]


Vihang Karajgaonkar commented on HIVE-19715:
--------------------------------------------

While I started working on it I realized a few things which could make changes 
in the design. By default, Thrift field requiredness is "default requiredness" 
[https://thrift.apache.org/docs/idl#field-requiredness] which is like a hybrid 
of {{optional}} and {{required}}. So in the write path thrift attempts to write 
them as long as its possible (null fields cannot be written IIUC). On the read 
side, reader always checks if the field is set. This is really the behavior 
what we want and fortunately, the Partition thrift definition has either 
default requiredness or optional which works well for partially filled 
partitions. So even in theory I can just return a List<Partition> for this API, 
but I think using PartitionSpec still makes a lot of sense since it groups the 
partitions according the {{table location, fieldSchema, deserializer class}}. I 
think in case of non-standard partition locations, there is no harm in grouping 
them together esp when there are lot of such non-standard partitions.

I am planning to use {{PropertyUtils}} from {{commons-beanutils}} package which 
is already in the classpath for metastore from {{apache-commons}} dependency. 
It provides the {{setNestedProperty}} method which can used to set the fields. 
All the fields defined in Thrift have setter methods so this should not cause 
any problems.

For setting the projected fields, in case of JDO we cannot set multi-valued 
fields in {{setResult}} clause which is a JDO limitation. In such a case the 
JDO version of the API will fall back to retrieving the full partitions. The 
directSQL version of the API however should be able to parse and set 
multi-valued fields like it does currently. I am currently looking at the 
directSQL implementation of setting partition fields and come up with a more 
maintainable way to selectively fire correct queries based on the projection 
field list instead of introducing bunch of if/else or case statements in that 
code, so I am thinking of creating a PartitionFieldParser class which will 
split out the right queries for the given list of fields. We will have to take 
care of optimizing the field list as well. It should remove redundant fields 
eg. if {{sd}} is present we can safe remove the redundant {{sd.location}} or 
{{sd.serdeInfo.serializationClass}}. Similarly, if all the nested fields of 
{{sd}} are present individually we can combine them together to form one field 
{{sd}}. I am currently treating these as optional improvements which I will fix 
later as needed.

I plan to divide the work into sub-tasks since each one of these could be 
considerable code change.
 1. Expose thrift API with the support for projected fields
 2. Add support for filters
 3. Add support for pagination

Will update the design doc based on the above modifications once I am close to 
completion of the sub-task 1 just in case there are more puzzles to solve.

> Consolidated and flexible API for fetching partition metadata from HMS
> ----------------------------------------------------------------------
>
>                 Key: HIVE-19715
>                 URL: https://issues.apache.org/jira/browse/HIVE-19715
>             Project: Hive
>          Issue Type: New Feature
>          Components: Standalone Metastore
>            Reporter: Todd Lipcon
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>         Attachments: HIVE-19715-design-doc.pdf
>
>
> Currently, the HMS thrift API exposes 17 different APIs for fetching 
> partition-related information. There is somewhat of a combinatorial explosion 
> going on, where each API has variants with and without "auth" info, by pspecs 
> vs names, by filters, by exprs, etc. Having all of these separate APIs long 
> term is a maintenance burden and also more confusing for consumers.
> Additionally, even with all of these APIs, there is a lack of granularity in 
> fetching only the information needed for a particular use case. For example, 
> in some use cases it may be beneficial to only fetch the partition locations 
> without wasting effort fetching statistics, etc.
> This JIRA proposes that we add a new "one API to rule them all" for fetching 
> partition info. The request and response would be encapsulated in structs. 
> Some desirable properties:
> - the request should be able to specify which pieces of information are 
> required (eg location, properties, etc)
> - in the case of partition parameters, the request should be able to do 
> either whitelisting or blacklisting (eg to exclude large incremental column 
> stats HLL dumped in there by Impala)
> - the request should optionally specify auth info (to encompas the 
> "with_auth" variants)
> - the request should be able to designate the set of partitions to access 
> through one of several different methods (eg "all", list<name>, expr, 
> part_vals, etc) 
> - the struct should be easily evolvable so that new pieces of info can be 
> added
> - the response should be designed in such a way as to avoid transferring 
> redundant information for common cases (eg simple "dictionary coding" of 
> strings like parameter names, etc)
> - the API should support some form of pagination for tables with large 
> partition counts



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19715) Consolidated and flexible API for fetching partition metadata from HMS

Reply via email to