[
https://issues.apache.org/jira/browse/HIVE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553396#comment-16553396
]
Vihang Karajgaonkar commented on HIVE-19715:
--------------------------------------------
Attached the first version of the design proposal for the new API.
TLDR
The API reuses existing {{PartitionSpec}} objects and makes some of the fields
in PartitionSpec as optional. It also supports the following:
1. Projection list which is a list of string of dot separated field names. So
example, clients who are interested only in partition locations can request
{{sd.location}} and the result will only include the locations instead of the
full partition objects.
2. FilterSpec which is provides different ways to filter the partitions for a
given table. The current supports {{BY_NAMES}}, {{BY_VALUES}} or {{BY_EXPR}}.
Although its not clear if there is value is providing {{BY_VALUES}} filters.
3. Pagination: API response contains a Pagination token which can used by the
clients to send subsequent requests to retrieve configurable batches of
partitions. The pagination token itself is a {{byte[]}} which client doesn't
need to interpret. Internally server can send some values to in the token like
last {{PART_ID}} sent previously, table modification stamp etc.
Any thoughts or suggestions?
cc: [~alangates] [~thejas] [~tlipcon] [~akolb]
> Consolidated and flexible API for fetching partition metadata from HMS
> ----------------------------------------------------------------------
>
> Key: HIVE-19715
> URL: https://issues.apache.org/jira/browse/HIVE-19715
> Project: Hive
> Issue Type: New Feature
> Components: Standalone Metastore
> Reporter: Todd Lipcon
> Assignee: Vihang Karajgaonkar
> Priority: Major
> Attachments: HIVE-19715-design-doc.pdf
>
>
> Currently, the HMS thrift API exposes 17 different APIs for fetching
> partition-related information. There is somewhat of a combinatorial explosion
> going on, where each API has variants with and without "auth" info, by pspecs
> vs names, by filters, by exprs, etc. Having all of these separate APIs long
> term is a maintenance burden and also more confusing for consumers.
> Additionally, even with all of these APIs, there is a lack of granularity in
> fetching only the information needed for a particular use case. For example,
> in some use cases it may be beneficial to only fetch the partition locations
> without wasting effort fetching statistics, etc.
> This JIRA proposes that we add a new "one API to rule them all" for fetching
> partition info. The request and response would be encapsulated in structs.
> Some desirable properties:
> - the request should be able to specify which pieces of information are
> required (eg location, properties, etc)
> - in the case of partition parameters, the request should be able to do
> either whitelisting or blacklisting (eg to exclude large incremental column
> stats HLL dumped in there by Impala)
> - the request should optionally specify auth info (to encompas the
> "with_auth" variants)
> - the request should be able to designate the set of partitions to access
> through one of several different methods (eg "all", list<name>, expr,
> part_vals, etc)
> - the struct should be easily evolvable so that new pieces of info can be
> added
> - the response should be designed in such a way as to avoid transferring
> redundant information for common cases (eg simple "dictionary coding" of
> strings like parameter names, etc)
> - the API should support some form of pagination for tables with large
> partition counts
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)