[ 
https://issues.apache.org/jira/browse/HIVE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493797#comment-16493797
 ] 

Todd Lipcon commented on HIVE-19715:
------------------------------------

bq. I also think we should at-least deprecate the older APIs so that clients 
can move to the newer APIs in the near future

Agreed, but I think realistically you can basically never remove a wire API 
that has been in use for so long with so many integrations. Marking it as 
deprecated is fine but my guess is we're stuck with our existing set for many 
more years even if the major ecosystem consumers move to the new one.

bq. handling of partition expressions - it would be great to avoid sending 
serialized Java classes and UDFs via Thrift API.

Agreed with that. Given that defining a language-agnostic way to pass arbitrary 
expression trees over Thrift is probably a larger project, though, I think we 
should aim to leave that out of the initial scope for this API. Given the API 
would be designed such that there are several options for specifying the 
filtering criteria, it shouldn't be too bad to add it in as a "v2".

Perhaps for "V1" we could just support the most commonly used simple criteria 
like equality or range predicates on individual columns. It seems that Presto 
for example only uses that functionality anyway today 
(https://github.com/prestodb/presto/blob/0.179/presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionManager.java#L215)

bq. One interesting side-effect of returning only subset of interesting fields 
of the partition objects is we probably will have to change the partition 
fields as optional instead of the required. This can create a trickle down 
effect all the way down to the database and I am not sure what complications 
can it cause. Thoughts?

Moving Thrift fields from "required" to "optional" is allowed so long as you 
don't try to send an object with a missing field to an old client that still 
thinks it is "required". That would cause the old client to fail. So, as long 
as we ensure that the existing APIs continue to set all fields that are marked 
"required" today, it would not be a wire-breaking change to downgrade them to 
optional and conditionally fill them in in the new API.

As for "trickle down to the database", I'm not sure I follow. On any API calls 
that today take a Partition object as an input (eg add_partition, 
alter_partition, etc) we'd need to add validation to ensure that all fields 
that are expected are set, whereas today we rely on Thrift to do so. But that 
should be all, right?

bq. Are you proposing Thrift version of Java interning?

Somewhat, yea -- or just a "symbol table" or "string table" if you will.

bq. Should we also have a unified way to send list of locations as a path trie 
(or some other compressed form)?

That's an interesting idea, though it certainly increases complexity since 
clients would also need to decode the trie. Do you think we have end-user 
clients who could consume the strings in their trie-encoded form, or would we 
have to just "decompress" them within the Metastore Client code and provide the 
end user with complete Strings? If the latter, I'm not sure if there would be a 
big gain vs compressing the whole thrift object on the wire with something like 
LZ4 or Snappy.


> Consolidated and flexible API for fetching partition metadata from HMS
> ----------------------------------------------------------------------
>
>                 Key: HIVE-19715
>                 URL: https://issues.apache.org/jira/browse/HIVE-19715
>             Project: Hive
>          Issue Type: New Feature
>          Components: Standalone Metastore
>            Reporter: Todd Lipcon
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>
> Currently, the HMS thrift API exposes 17 different APIs for fetching 
> partition-related information. There is somewhat of a combinatorial explosion 
> going on, where each API has variants with and without "auth" info, by pspecs 
> vs names, by filters, by exprs, etc. Having all of these separate APIs long 
> term is a maintenance burden and also more confusing for consumers.
> Additionally, even with all of these APIs, there is a lack of granularity in 
> fetching only the information needed for a particular use case. For example, 
> in some use cases it may be beneficial to only fetch the partition locations 
> without wasting effort fetching statistics, etc.
> This JIRA proposes that we add a new "one API to rule them all" for fetching 
> partition info. The request and response would be encapsulated in structs. 
> Some desirable properties:
> - the request should be able to specify which pieces of information are 
> required (eg location, properties, etc)
> - in the case of partition parameters, the request should be able to do 
> either whitelisting or blacklisting (eg to exclude large incremental column 
> stats HLL dumped in there by Impala)
> - the request should optionally specify auth info (to encompas the 
> "with_auth" variants)
> - the request should be able to designate the set of partitions to access 
> through one of several different methods (eg "all", list<name>, expr, 
> part_vals, etc) 
> - the struct should be easily evolvable so that new pieces of info can be 
> added
> - the response should be designed in such a way as to avoid transferring 
> redundant information for common cases (eg simple "dictionary coding" of 
> strings like parameter names, etc)
> - the API should support some form of pagination for tables with large 
> partition counts



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to