Hi, dev Bump this thread! I just filed a ticket to track&fix this incompatibility issues about hms column stats thrift api. I think we can fix this at its root in HMS side and then any other third components will not suffer from this issue. https://issues.apache.org/jira/browse/HIVE-27984
Thanks, Butao Zhang ---- Replied Message ---- | From | Okumin<m...@okumin.com> | | Date | 8/20/2023 18:37 | | To | <dev@hive.apache.org> | | Subject | Re: How to use `engine` introduced by HIVE-22046 | Hi Butao, Thanks for sharing your PR! I didn't find trinodb/trino-hive-apache or trinodb/hive-thrift. As mentioned in the PR, the current Thrift definitions might not be the final version, but it sounds reasonable to give information to external products since we versioned Hive 4 beta. I'm curious if anyone why we give different engine names to Hive and Impala and what are the recommended options. Thanks, Okumin On Fri, Aug 11, 2023 at 10:39 AM Butao Zhang <butaozha...@163.com> wrote: Hi, Okumin I have encountered this issue before, and the 'validWriteIdList' is also a incompatibility parameter. I have submit a PR in trino-hive-apache repo, and you can refer to https://github.com/trinodb/trino-hive-apache/pull/43 . IIUC, the 'engine' parameter is used to differentiate between stats produced by different engines(Hive&Spark&Presto&Impala), but it seems that the downstream engines do not want to adopt&realize the new 'engine' parameter. At present, if some engines(e.g. Trino) use the customized thrift api to interact wiht hms, it must change its thrift file to match the thrift definition of hms. BTW, maybe we can change hms thrift file to make the 'engine' parameter optional and then other customized thrift client will not have compatibility issues. Thanks, Butao Zhang ---- Replied Message ---- | From | Okumin<m...@okumin.com> | | Date | 8/10/2023 23:41 | | To | <dev@hive.apache.org> | | Subject | How to use `engine` introduced by HIVE-22046 | Hi Hive developers, I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm testing integration between Hive 4 and other software. If I understand correctly, clients are currently required to additionally specify the engine name when they get or update column statistics. - https://issues.apache.org/jira/browse/HIVE-22046 - https://github.com/apache/hive/pull/741 For example, Trino has a feature to use column stats and it fails. Note that I am not 100% sure about Trino's implementation or behavior. ``` trino> create table hive.default.test_trino (id int); Query 20230810_152236_00004_t9n6h failed: Required field 'engine' is unset! Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id], engine:null) ``` I have two questions about this feature. (1) Should any engine use a unique engine name? I guess some software can store or use stats compatible with Hive. I wonder if it can reuse engine=hive in that case, or should use a different name like engine=trino. I see Impala gives a unique engine name to metastore. Taking a glance, Spark is unlikely to be using col stats of Hive directly. - https://issues.apache.org/jira/browse/IMPALA-8842 (2) Should Hive Metastore use engine=hive as a default value? If other compatible software can reuse engine=hive, it could be an option to accept requests with the old format assuming its engine is "hive" for compatibility. Or should they explicitly specify engine=hive when using Hive 4? Regards, Okumin