Re: How to use `engine` introduced by HIVE-22046

Butao Zhang Sun, 07 Jan 2024 20:37:31 -0800

Hi, dev
Bump this thread! 
I just filed a ticket to track&fix this incompatibility issues about hms column 
stats thrift api.  I think we can fix this at its root in HMS side and then any 
other third components will not suffer from this issue.
https://issues.apache.org/jira/browse/HIVE-27984



Thanks,
Butao Zhang
---- Replied Message ----
| From | Okumin<[email protected]> |
| Date | 8/20/2023 18:37 |
| To | <[email protected]> |
| Subject | Re: How to use `engine` introduced by HIVE-22046 |
Hi Butao,

Thanks for sharing your PR! I didn't find trinodb/trino-hive-apache
or trinodb/hive-thrift.

As mentioned in the PR, the current Thrift definitions might not be the
final version, but it sounds reasonable to give information to external
products since we versioned Hive 4 beta. I'm curious if anyone why we give
different engine names to Hive and Impala and what are the recommended
options.

Thanks,
Okumin



On Fri, Aug 11, 2023 at 10:39 AM Butao Zhang <[email protected]> wrote:

Hi, Okumin


I have encountered this issue before, and the 'validWriteIdList' is also a
incompatibility parameter. I have submit a PR in trino-hive-apache repo,
and you can refer to https://github.com/trinodb/trino-hive-apache/pull/43
.
IIUC, the 'engine' parameter is used to differentiate between stats
produced by different engines(Hive&Spark&Presto&Impala), but it seems that
the downstream engines do not want to adopt&realize the new 'engine'
parameter.
At present, if some engines(e.g. Trino) use the customized thrift api to
interact wiht hms, it must change its thrift file to match the thrift
definition of hms.
BTW, maybe we can change hms thrift file to make the 'engine' parameter
optional and then other customized thrift client will not have
compatibility issues.

Thanks,

Butao Zhang

---- Replied Message ----
| From | Okumin<[email protected]> |
| Date | 8/10/2023 23:41 |
| To | <[email protected]> |
| Subject | How to use `engine` introduced by HIVE-22046 |
Hi Hive developers,

I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm
testing integration between Hive 4 and other software. If I understand
correctly, clients are currently required to additionally specify the
engine name when they get or update column statistics.

- https://issues.apache.org/jira/browse/HIVE-22046
- https://github.com/apache/hive/pull/741

For example, Trino has a feature to use column stats and it fails. Note
that I am not 100% sure about Trino's implementation or behavior.

```
trino> create table hive.default.test_trino (id int);
Query 20230810_152236_00004_t9n6h failed: Required field 'engine' is unset!
Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id],
engine:null)
```

I have two questions about this feature.

(1) Should any engine use a unique engine name?

I guess some software can store or use stats compatible with Hive. I wonder
if it can reuse engine=hive in that case, or should use a different name
like engine=trino.

I see Impala gives a unique engine name to metastore. Taking a glance,
Spark is unlikely to be using col stats of Hive directly.

- https://issues.apache.org/jira/browse/IMPALA-8842

(2) Should Hive Metastore use engine=hive as a default value?

If other compatible software can reuse engine=hive, it could be an option
to accept requests with the old format assuming its engine is "hive" for
compatibility. Or should they explicitly specify engine=hive when using
Hive 4?

Regards,
Okumin

Re: How to use `engine` introduced by HIVE-22046

Reply via email to