[
https://issues.apache.org/jira/browse/HIVE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568590#comment-16568590
]
Vihang Karajgaonkar commented on HIVE-19715:
--------------------------------------------
While I started working on it I realized a few things which could make changes
in the design. By default, Thrift field requiredness is "default requiredness"
[https://thrift.apache.org/docs/idl#field-requiredness] which is like a hybrid
of {{optional}} and {{required}}. So in the write path thrift attempts to write
them as long as its possible (null fields cannot be written IIUC). On the read
side, reader always checks if the field is set. This is really the behavior
what we want and fortunately, the Partition thrift definition has either
default requiredness or optional which works well for partially filled
partitions. So even in theory I can just return a List<Partition> for this API,
but I think using PartitionSpec still makes a lot of sense since it groups the
partitions according the {{table location, fieldSchema, deserializer class}}. I
think in case of non-standard partition locations, there is no harm in grouping
them together esp when there are lot of such non-standard partitions.
I am planning to use {{PropertyUtils}} from {{commons-beanutils}} package which
is already in the classpath for metastore from {{apache-commons}} dependency.
It provides the {{setNestedProperty}} method which can used to set the fields.
All the fields defined in Thrift have setter methods so this should not cause
any problems.
For setting the projected fields, in case of JDO we cannot set multi-valued
fields in {{setResult}} clause which is a JDO limitation. In such a case the
JDO version of the API will fall back to retrieving the full partitions. The
directSQL version of the API however should be able to parse and set
multi-valued fields like it does currently. I am currently looking at the
directSQL implementation of setting partition fields and come up with a more
maintainable way to selectively fire correct queries based on the projection
field list instead of introducing bunch of if/else or case statements in that
code, so I am thinking of creating a PartitionFieldParser class which will
split out the right queries for the given list of fields. We will have to take
care of optimizing the field list as well. It should remove redundant fields
eg. if {{sd}} is present we can safe remove the redundant {{sd.location}} or
{{sd.serdeInfo.serializationClass}}. Similarly, if all the nested fields of
{{sd}} are present individually we can combine them together to form one field
{{sd}}. I am currently treating these as optional improvements which I will fix
later as needed.
I plan to divide the work into sub-tasks since each one of these could be
considerable code change.
1. Expose thrift API with the support for projected fields
2. Add support for filters
3. Add support for pagination
Will update the design doc based on the above modifications once I am close to
completion of the sub-task 1 just in case there are more puzzles to solve.
> Consolidated and flexible API for fetching partition metadata from HMS
> ----------------------------------------------------------------------
>
> Key: HIVE-19715
> URL: https://issues.apache.org/jira/browse/HIVE-19715
> Project: Hive
> Issue Type: New Feature
> Components: Standalone Metastore
> Reporter: Todd Lipcon
> Assignee: Vihang Karajgaonkar
> Priority: Major
> Attachments: HIVE-19715-design-doc.pdf
>
>
> Currently, the HMS thrift API exposes 17 different APIs for fetching
> partition-related information. There is somewhat of a combinatorial explosion
> going on, where each API has variants with and without "auth" info, by pspecs
> vs names, by filters, by exprs, etc. Having all of these separate APIs long
> term is a maintenance burden and also more confusing for consumers.
> Additionally, even with all of these APIs, there is a lack of granularity in
> fetching only the information needed for a particular use case. For example,
> in some use cases it may be beneficial to only fetch the partition locations
> without wasting effort fetching statistics, etc.
> This JIRA proposes that we add a new "one API to rule them all" for fetching
> partition info. The request and response would be encapsulated in structs.
> Some desirable properties:
> - the request should be able to specify which pieces of information are
> required (eg location, properties, etc)
> - in the case of partition parameters, the request should be able to do
> either whitelisting or blacklisting (eg to exclude large incremental column
> stats HLL dumped in there by Impala)
> - the request should optionally specify auth info (to encompas the
> "with_auth" variants)
> - the request should be able to designate the set of partitions to access
> through one of several different methods (eg "all", list<name>, expr,
> part_vals, etc)
> - the struct should be easily evolvable so that new pieces of info can be
> added
> - the response should be designed in such a way as to avoid transferring
> redundant information for common cases (eg simple "dictionary coding" of
> strings like parameter names, etc)
> - the API should support some form of pagination for tables with large
> partition counts
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)