[
https://issues.apache.org/jira/browse/HIVE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493797#comment-16493797
]
Todd Lipcon commented on HIVE-19715:
------------------------------------
bq. I also think we should at-least deprecate the older APIs so that clients
can move to the newer APIs in the near future
Agreed, but I think realistically you can basically never remove a wire API
that has been in use for so long with so many integrations. Marking it as
deprecated is fine but my guess is we're stuck with our existing set for many
more years even if the major ecosystem consumers move to the new one.
bq. handling of partition expressions - it would be great to avoid sending
serialized Java classes and UDFs via Thrift API.
Agreed with that. Given that defining a language-agnostic way to pass arbitrary
expression trees over Thrift is probably a larger project, though, I think we
should aim to leave that out of the initial scope for this API. Given the API
would be designed such that there are several options for specifying the
filtering criteria, it shouldn't be too bad to add it in as a "v2".
Perhaps for "V1" we could just support the most commonly used simple criteria
like equality or range predicates on individual columns. It seems that Presto
for example only uses that functionality anyway today
(https://github.com/prestodb/presto/blob/0.179/presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionManager.java#L215)
bq. One interesting side-effect of returning only subset of interesting fields
of the partition objects is we probably will have to change the partition
fields as optional instead of the required. This can create a trickle down
effect all the way down to the database and I am not sure what complications
can it cause. Thoughts?
Moving Thrift fields from "required" to "optional" is allowed so long as you
don't try to send an object with a missing field to an old client that still
thinks it is "required". That would cause the old client to fail. So, as long
as we ensure that the existing APIs continue to set all fields that are marked
"required" today, it would not be a wire-breaking change to downgrade them to
optional and conditionally fill them in in the new API.
As for "trickle down to the database", I'm not sure I follow. On any API calls
that today take a Partition object as an input (eg add_partition,
alter_partition, etc) we'd need to add validation to ensure that all fields
that are expected are set, whereas today we rely on Thrift to do so. But that
should be all, right?
bq. Are you proposing Thrift version of Java interning?
Somewhat, yea -- or just a "symbol table" or "string table" if you will.
bq. Should we also have a unified way to send list of locations as a path trie
(or some other compressed form)?
That's an interesting idea, though it certainly increases complexity since
clients would also need to decode the trie. Do you think we have end-user
clients who could consume the strings in their trie-encoded form, or would we
have to just "decompress" them within the Metastore Client code and provide the
end user with complete Strings? If the latter, I'm not sure if there would be a
big gain vs compressing the whole thrift object on the wire with something like
LZ4 or Snappy.
> Consolidated and flexible API for fetching partition metadata from HMS
> ----------------------------------------------------------------------
>
> Key: HIVE-19715
> URL: https://issues.apache.org/jira/browse/HIVE-19715
> Project: Hive
> Issue Type: New Feature
> Components: Standalone Metastore
> Reporter: Todd Lipcon
> Assignee: Vihang Karajgaonkar
> Priority: Major
>
> Currently, the HMS thrift API exposes 17 different APIs for fetching
> partition-related information. There is somewhat of a combinatorial explosion
> going on, where each API has variants with and without "auth" info, by pspecs
> vs names, by filters, by exprs, etc. Having all of these separate APIs long
> term is a maintenance burden and also more confusing for consumers.
> Additionally, even with all of these APIs, there is a lack of granularity in
> fetching only the information needed for a particular use case. For example,
> in some use cases it may be beneficial to only fetch the partition locations
> without wasting effort fetching statistics, etc.
> This JIRA proposes that we add a new "one API to rule them all" for fetching
> partition info. The request and response would be encapsulated in structs.
> Some desirable properties:
> - the request should be able to specify which pieces of information are
> required (eg location, properties, etc)
> - in the case of partition parameters, the request should be able to do
> either whitelisting or blacklisting (eg to exclude large incremental column
> stats HLL dumped in there by Impala)
> - the request should optionally specify auth info (to encompas the
> "with_auth" variants)
> - the request should be able to designate the set of partitions to access
> through one of several different methods (eg "all", list<name>, expr,
> part_vals, etc)
> - the struct should be easily evolvable so that new pieces of info can be
> added
> - the response should be designed in such a way as to avoid transferring
> redundant information for common cases (eg simple "dictionary coding" of
> strings like parameter names, etc)
> - the API should support some form of pagination for tables with large
> partition counts
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)