Re: IMPORTANT: Hive date parsing issue, input needed
Hi, I put supplementary information in the ticket, such as examples of Hive 2, Hive 3, Spark, Trino, or my research of the ANSI standard. Regards, Okumin On Wed, Sep 18, 2024 at 7:17 PM Attila Turoczy wrote: > > Hi Folks, > > Let me share my opinion about this change and request. To have an option for > parse data from string is a very valuable feature especially in OLAP, because > many of the input data many times come from different formats. It is easy and > makes sense that the date follows the ISO format. But real life is different. > Especially on data lakes the sources could come from many formats and the ETL > job parses excel, json, csv etc to create tables, insert rows. For those > countries like Hungary where the ISO date format is default everywhere the > HIVE-27586 would not cause any issue. However, in regions such as the US and > UK, where date formats like mm/dd/ and dd/mm/ are common (e.g., in > Excel files), this could present challenges. In these cases, users may > encounter problems. In my mind especially for these formats it has to be null > or as an oracle throws an error. If users fail to validate the output, this > could lead to data correctness issues or, worse, potential data loss if > source files are deleted. Yes, Hive users do not like the behavior change, > but one type (one of the most popular formats in the US) could lead to more > problems. > > -Attila > > On Wed, Sep 18, 2024 at 9:43 AM Ayush Saxena wrote: >> >> Since this is landing in Spam for some reason for most people, I am >> explicitly forwarding it rather than replying, so as people can get >> it. >> >> I will vote for (2) considering it was returning `null` earlier as >> well & the new behaviour came up recently, it would be a behaviour >> change for folks migrating from earlier versions of hive to 4.x & the >> new behaviour ain't that fancy either... >> >> -Ayush >> >> -- Forwarded message - >> From: Zoltán Rátkai >> Date: Wed, 18 Sept 2024 at 12:55 >> Subject: Re: IMPORTANT: Hive date parsing issue, input needed >> To: >> Cc: dev >> >> >> Hi Ayush, >> >> thanks for your thoughts! >> >> My answers: >> >> - On the official Hive page about types this is not documented at all: >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82706456#LanguageManualTypes-date >> >> "Dates: DATE values describe a particular year/month/day, in the form >> -MM-DD. For example, DATE '2013-01-01'. Date types do not have >> a time of day component. The range of values supported for the Date >> type is -01-01 to -12-31, dependent on support by the >> primitive Java Date type." >> >> - Since the beginning of Hive it worked like returning null, this was >> introduced one year ago as a side effect of this: HIVE-27586 "Parse >> dates from strings ignoring trailing (potentialy) invalid chars", so >> the goal of this ticket was to be able to handle string to date >> parsing like : "2024-09-18 GARBAGE" and parse it to 2024-09-18 as >> date. >> So it is not a long standing behaviour. >> >> I made a comparison of other DB (Postre, MySQL, ORACLE) and none of >> them handle it like Hive now. Please check it here: >> https://issues.apache.org/jira/browse/HIVE-28483 >> I agree with you to align with those. This is why I asked this question. >> >> Thanks and regards, >> >> Zoltan Ratkai >> >> >> On Wed, Sep 18, 2024 at 8:58 AM Ayush Saxena wrote: >> > >> > ++ Adding the dev mailing list, as this might be relevant to them as well. >> > >> > I have a few additional questions: >> > >> > * Has this behavior been present since the inception of Hive, or did >> > it start occurring more recently? If this has been the behavior for >> > over a decade, it might be best to leave it unchanged. However, if >> > this is a recent development, we should aim to restore the original >> > behavior. In my opinion, any change that alters long-standing behavior >> > could be considered incompatible unless it was introduced to prevent >> > data loss or address a security issue. >> > * How do other engines (e.g., Impala, Spark) handle similar scenarios, >> > and how do other databases like MySQL and PostgreSQL behave in this >> > regard? In the past, we've typically aimed to align with these >> > systems, not necessarily exactly, but closely enough—especially in >> > cases where there's ambigu
Bucket Map Join on Iceberg
Hi Hive developers, I started the Partition-Aware Optimization initiative[1], which will implement algorithms using bucketing or sorting definitions on top of Apache Iceberg or other table formats' partition specs. I created the first PR to unlock Bucket Map Join to Apache Iceberg-based tables. I'd be glad, especially if Iceberg experts could take a look. I appreciate that Denys already had a look. Thanks. https://github.com/apache/hive/pull/5409 - [1] https://issues.apache.org/jira/browse/HIVE-28410 Regards, Okumin
Re: Reserved keywords of Hive 4
Hi, I updated the wiki page. Please feel free to comment on HIVE-19741 if you find any issues. If we find no problems, I will close the ticket by the end of this week. Thanks for giving me advice and permission to update the WIKI. Regards, Okumin On Sat, Aug 17, 2024 at 1:28 AM Okumin wrote: > > Hi Denys, > > We have yet to get it, but we will list it unless anyone else in the > community already has it. We'd be delighted if you could help us > update the WIKI then! > > Regards, > Okumin > > On Fri, Aug 16, 2024 at 6:33 PM Denys Kuzmenko wrote: > > > > Hi Okumin, > > > > If you have a list, I can update the wiki or grant you the edit permissions.
Re: Reserved keywords of Hive 4
Hi Denys, We have yet to get it, but we will list it unless anyone else in the community already has it. We'd be delighted if you could help us update the WIKI then! Regards, Okumin On Fri, Aug 16, 2024 at 6:33 PM Denys Kuzmenko wrote: > > Hi Okumin, > > If you have a list, I can update the wiki or grant you the edit permissions.
Reserved keywords of Hive 4
Hi Hive developers, Could anyone please update Hive's reserved keywords in the WIKI if they have the latest list? If no one has the list, we are willing to help investigate new words. https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl We already have a related ticket. https://issues.apache.org/jira/browse/HIVE-19741 And some words seem to be unintentionally reserved. I'm working on it. https://issues.apache.org/jira/browse/HIVE-28446 Regards, Okumin
Re: CI Jenkins upgrade
Hi Stamatis, Thanks for the update. The existence of a reliable CI system is one of the biggest reasons why we prefer to always send our patches to the upstream and keep using as new a version as possible. Regards, Okumin On Thu, Jul 4, 2024 at 7:15 PM Stamatis Zampetakis wrote: > > The upgrade completed successfully and https://ci.hive.apache.org/ is > now running with Jenkins 2.452.2. Please let me know if you encounter > any issues. > > Best, > Stamatis > > On Thu, Jul 4, 2024 at 11:08 AM Stamatis Zampetakis wrote: > > > > The upgrade will start in a few minutes. To avoid erratic behavior > > please avoid doing Jenkins related operations during the upgrade > > interval. > > > > I will send another email once the upgrade is done. > > > > Best, > > Stamatis > > > > On Wed, Jul 3, 2024 at 2:31 PM Stamatis Zampetakis > > wrote: > > > > > > Hi all, > > > > > > I plan to proceed with the Jenkins upgrade tomorrow July 4, 2024 > > > starting at 09:00 UTC. > > > > > > The https://ci.hive.apache.org/ will be unavailable for at least a few > > > minutes and if all goes well, less than an hour. > > > > > > If there are any reasons to postpone the upgrade please let me know. > > > > > > Best, > > > Stamatis > > > > > > On Tue, Jun 25, 2024 at 10:31 AM Stamatis Zampetakis > > > wrote: > > > > > > > > Hey team, > > > > > > > > As part of HIVE-28339, we are planning to upgrade the Jenkins version > > > > that is used to run our pre-commit tests. > > > > > > > > The Jenkins instance is running on a Kubernetes cluster deployed on > > > > Google Cloud. > > > > > > > > I am willing to push HIVE-28339 forward but I don't have any > > > > experience in Jenkins upgrades. If someone else has done this in the > > > > past and has any insights about the process feel free to comment under > > > > the ticket otherwise I will hone my devops skills :) > > > > > > > > Probably there will be some downtime during the upgrade but we will > > > > share more about this aspect once we have a more concrete plan. > > > > > > > > Best, > > > > Stamatis
Re: Re: Next Hive 4.0.1 minor release
Hi, I'm posting some more suggestions. # We may label HIVE-28352 as "hive-4.0.1-must" I found that users were unable to upgrade some system tables from Hive 3 through schematool. It would be better to fix. https://issues.apache.org/jira/browse/HIVE-28352 # We may drop HIVE-24167 I am the ticket owner, and we are struggling to figure out how to resolve the problem. We can potentially skip it as branch-4.0 includes the following workaround. https://issues.apache.org/jira/browse/HIVE-27856 Thanks, Okumin On Wed, May 15, 2024 at 5:18 PM dengzhhu653 wrote: > > > Thank you for your feedback and testing, Okumin! I've tagged them with > "hive-4.0.1-must". > > > Thanks, > > Zhihua > > At 2024-05-15 10:43:59, "Okumin" wrote: > >Hi Zhihua, > > > >Thanks for driving the next release. We are actively testing 4.0.0 and > >would like to give some suggestions. > > > ># HIVE-27847: Prevent query Failures on Numeric <-> Timestamp > >We hit the issue when we ran Hive 4 with the option. I believe it is > >worth resolving for those who want to try Hive 4, keeping > >compatibilities with a previous version. > >https://issues.apache.org/jira/browse/HIVE-27847 > > > ># HIVE-28098: Fails to copy empty column statistics of materialized CTE > >This follows up on HIVE-28080, but the current 4.0.0 includes only > >HIVE-28080. The reasonable option to me is to revert HIVE-28080 or > >cherry-pick HIVE-28098, all or nothing. > >https://issues.apache.org/jira/browse/HIVE-28098 > > > >Thanks, > >Okumin > > > >On Sat, May 11, 2024 at 9:45 AM dengzhhu653 wrote: > >> > >> Hello Community, > >> > >> > >> As you have noticed, we are going to propose the next 4.0.1 release on top > >> of 4.0.0, with some > >> > >> critical bug fixes and improvements [1]. As of now we are putting the > >> label "hive-4.0.1-must" on the tickets > >> > >> and we plan to make sure those get c-picked to branch-4.0 [2]. Please > >> suggest other important fixes that can be > >> > >> included in this release if any. > >> > >> > >> We will get this minor release out as soon as possible once all the > >> tickets marked with "hive-4.0.1-must" get resolved and tested. > >> > >> > >> [1] https://lists.apache.org/thread/rkw2toj5d74t8n5jvnkrfw77hyzn7qh3 > >> > >> [2] > >> https://issues.apache.org/jira/browse/HIVE-28204?jql=labels%20%3D%20hive-4.0.1-must > >> > >> > >> Thanks, > >> > >> Zhihua
Re: Next Hive 4.0.1 minor release
Hi Zhihua, Thanks for driving the next release. We are actively testing 4.0.0 and would like to give some suggestions. # HIVE-27847: Prevent query Failures on Numeric <-> Timestamp We hit the issue when we ran Hive 4 with the option. I believe it is worth resolving for those who want to try Hive 4, keeping compatibilities with a previous version. https://issues.apache.org/jira/browse/HIVE-27847 # HIVE-28098: Fails to copy empty column statistics of materialized CTE This follows up on HIVE-28080, but the current 4.0.0 includes only HIVE-28080. The reasonable option to me is to revert HIVE-28080 or cherry-pick HIVE-28098, all or nothing. https://issues.apache.org/jira/browse/HIVE-28098 Thanks, Okumin On Sat, May 11, 2024 at 9:45 AM dengzhhu653 wrote: > > Hello Community, > > > As you have noticed, we are going to propose the next 4.0.1 release on top of > 4.0.0, with some > > critical bug fixes and improvements [1]. As of now we are putting the label > "hive-4.0.1-must" on the tickets > > and we plan to make sure those get c-picked to branch-4.0 [2]. Please suggest > other important fixes that can be > > included in this release if any. > > > We will get this minor release out as soon as possible once all the tickets > marked with "hive-4.0.1-must" get resolved and tested. > > > [1] https://lists.apache.org/thread/rkw2toj5d74t8n5jvnkrfw77hyzn7qh3 > > [2] > https://issues.apache.org/jira/browse/HIVE-28204?jql=labels%20%3D%20hive-4.0.1-must > > > Thanks, > > Zhihua
Re: [VOTE] Mark Hive 2.x EOL
+1 (non-binding) I appreciate the community's efforts in maintaining 2.x for so long. Thanks, Okumin On Sat, May 11, 2024 at 1:57 AM Abhishek Gupta wrote: > > Unsubscribe > > On Fri, 10 May 2024 at 10:26 PM, Aman Sinha wrote: >> >> +1 (non-binding) >> >> On Fri, May 10, 2024 at 7:57 AM Mahesh Raju Somalaraju >> wrote: >>> >>> +1(non-binding) >>> >>> Thanks >>> Mahesh Raju S >>> >>> On Fri, 10 May 2024, 06:15 Ayush Saxena, wrote: >>>> >>>> Hi All, >>>> Following the discussion at [1]. Starting the official vote thread to >>>> mark Hive 2.x release line as EOL. >>>> >>>> Marking a release lines as EOL means there won't be any further >>>> release made for that release line >>>> >>>> I will start with my +1 >>>> >>>> -Ayush >>>> >>>> >>>> [1] https://lists.apache.org/thread/91wk3oy1qo953md7941ojg2q97ofsl2d
Re: [ANNOUNCE] Apache Hive 4.0.0 Released
I'm really excited to see the news! I can easily imagine the difficulty of testing and shipping Hive 4.0.0 with more than 5k commits. I'm proud to have witnessed this moment here. Thank you! On Wed, Apr 3, 2024 at 3:07 AM Naveen Gangam wrote: > > Thank you for the tremendous amount of work put in by many many folks to make > this release happen, including projects hive is dependent upon like tez. > > Thank you to all the PMC members, committers and contributors for all the > work over the past 5+ years in shaping this release. > > THANK YOU!!! > > On Sun, Mar 31, 2024 at 8:54 AM Battula, Brahma Reddy > wrote: >> >> Thank you for your hard work and dedication in releasing Apache Hive version >> 4.0.0. >> >> >> >> Congratulations to the entire team on this achievement. Keep up the great >> work! >> >> >> >> Does this consider as GA.? >> >> >> >> And Looks we need to update in the following location also.? >> >> https://hive.apache.org/general/downloads/ >> >> >> >> >> >> From: Denys Kuzmenko >> Date: Saturday, March 30, 2024 at 00:07 >> To: u...@hive.apache.org , dev@hive.apache.org >> >> Subject: [ANNOUNCE] Apache Hive 4.0.0 Released >> >> The Apache Hive team is proud to announce the release of Apache Hive >> >> version 4.0.0. >> >> >> >> The Apache Hive (TM) data warehouse software facilitates querying and >> >> managing large datasets residing in distributed storage. Built on top >> >> of Apache Hadoop (TM), it provides, among others: >> >> >> >> * Tools to enable easy data extract/transform/load (ETL) >> >> >> >> * A mechanism to impose structure on a variety of data formats >> >> >> >> * Access to files stored either directly in Apache HDFS (TM) or in other >> >> data storage systems such as Apache HBase (TM) >> >> >> >> * Query execution via Apache Hadoop MapReduce, Apache Tez and Apache Spark >> frameworks. (MapReduce is deprecated, and Spark has been removed so the text >> needs to be modified depending on the release version) >> >> >> >> For Hive release details and downloads, please visit: >> >> https://hive.apache.org/downloads.html >> >> >> >> Hive 4.0.0 Release Notes are available here: >> >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343343&styleName=Text&projectId=12310843 >> >> >> >> We would like to thank the many contributors who made this release >> >> possible. >> >> >> >> Regards, >> >> >> >> The Apache Hive Team
Re: Release of Hive 4 and TPC-DS benchmark
Hi, We deployed a test version of Hive 4 in our production. It is mostly working well. I appreciate the community's effort. Let me bump this thread so that community members can share the latest situation, and we can contribute to the remaining problems. Currently, we list the following tickets labeled with hive-4.0.0-must[1]. - HIVE-27858: OOM happens when selecting many columns and JOIN - HIVE-26654: Test with the TPC-DS benchmark - HIVE-24167: No.14 (Postponed until 4.1.0) - HIVE-26986: No.71 (No labels) We have already applied a workaround[2] for HIVE-24167. Also, Denys says HIVE-26986 is not a blocker of Hive 4.0.0. So, I expect HIVE-27858 to be the only blocker. I have some proposals about the next action. - Can we take over HIVE-27858? We are the reporter of the ticket. It would be reasonable If Naveen is not working on it yet - We may review HIVE-27984[3]. As I mentioned in another thread[4], it would have a fairly big impact on the Hive ecosystem And I would like someone to take a look at the following PR for HIVE-24167. It is not a blocker thanks to `hive.optimize.cte.materialize.threshold=-1`. However, we plan to use Hive 4 with CTE materialization. https://github.com/apache/hive/pull/5037 Thanks! - [1] https://issues.apache.org/jira/issues/?jql=labels%20%3D%20hive-4.0.0-must - [2] https://github.com/apache/hive/pull/4858 - [3] https://github.com/apache/hive/pull/4984 - [4] https://www.mail-archive.com/dev@hive.apache.org/msg144280.html Regards, Okumin On Thu, Nov 23, 2023 at 11:42 PM Butao Zhang wrote: > > Cool! Thank you Denys for driving this! > > > Thanks, > Butao Zhang > Replied Message > | From | Attila Turoczy | > | Date | 11/23/2023 22:29 | > | To | | > | Subject | Re: Release of Hive 4 and TPC-DS benchmark | > Excellent new Denys! Hive 4 is here! Can't wait :) > > -Attila > > > On Thu, Nov 23, 2023 at 3:20 PM Denys Kuzmenko wrote: > > Update: > 1. Query 2, 71: Resolved in HIVE-27006 [1]; > > 2. Query 97: Under review in HIVE-27269 [2]; > Thanks, Seonggon for providing a reproduce qfile. > > 3. Query 14: Reported in HIVE-24167 [3]; > set hive.optimize.cte.materialize.threshold to -1 by default in Hive 4 and > fix it in the next versions > > 4. HIVE-26986 [4] is a performance improvement that is nice to have, but > not a blocker for the release. > > Be advised, next week we plan to cut 4.0.0 release branch from master and > start testing. > > Thanks, Denys > > [1] https://issues.apache.org/jira/browse/HIVE-27006 > [2] https://issues.apache.org/jira/browse/HIVE-27269 > [3] https://issues.apache.org/jira/browse/HIVE-24167 > [4] https://issues.apache.org/jira/browse/HIVE-26986 > >
Re: How to use `engine` introduced by HIVE-22046
Hi Butao, Thanks for reminding us about the issue. +1 for the approach. Thanks, Okumin On Mon, Jan 8, 2024 at 1:38 PM Butao Zhang wrote: > > Hi, dev > Bump this thread! > I just filed a ticket to track&fix this incompatibility issues about hms > column stats thrift api. I think we can fix this at its root in HMS side and > then any other third components will not suffer from this issue. > https://issues.apache.org/jira/browse/HIVE-27984 > > > Thanks, > Butao Zhang > ---- Replied Message > | From | Okumin | > | Date | 8/20/2023 18:37 | > | To | | > | Subject | Re: How to use `engine` introduced by HIVE-22046 | > Hi Butao, > > Thanks for sharing your PR! I didn't find trinodb/trino-hive-apache > or trinodb/hive-thrift. > > As mentioned in the PR, the current Thrift definitions might not be the > final version, but it sounds reasonable to give information to external > products since we versioned Hive 4 beta. I'm curious if anyone why we give > different engine names to Hive and Impala and what are the recommended > options. > > Thanks, > Okumin > > > > On Fri, Aug 11, 2023 at 10:39 AM Butao Zhang wrote: > > Hi, Okumin > > > I have encountered this issue before, and the 'validWriteIdList' is also a > incompatibility parameter. I have submit a PR in trino-hive-apache repo, > and you can refer to https://github.com/trinodb/trino-hive-apache/pull/43 > . > IIUC, the 'engine' parameter is used to differentiate between stats > produced by different engines(Hive&Spark&Presto&Impala), but it seems that > the downstream engines do not want to adopt&realize the new 'engine' > parameter. > At present, if some engines(e.g. Trino) use the customized thrift api to > interact wiht hms, it must change its thrift file to match the thrift > definition of hms. > BTW, maybe we can change hms thrift file to make the 'engine' parameter > optional and then other customized thrift client will not have > compatibility issues. > > Thanks, > > Butao Zhang > > Replied Message > | From | Okumin | > | Date | 8/10/2023 23:41 | > | To | | > | Subject | How to use `engine` introduced by HIVE-22046 | > Hi Hive developers, > > I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm > testing integration between Hive 4 and other software. If I understand > correctly, clients are currently required to additionally specify the > engine name when they get or update column statistics. > > - https://issues.apache.org/jira/browse/HIVE-22046 > - https://github.com/apache/hive/pull/741 > > For example, Trino has a feature to use column stats and it fails. Note > that I am not 100% sure about Trino's implementation or behavior. > > ``` > trino> create table hive.default.test_trino (id int); > Query 20230810_152236_4_t9n6h failed: Required field 'engine' is unset! > Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id], > engine:null) > ``` > > I have two questions about this feature. > > (1) Should any engine use a unique engine name? > > I guess some software can store or use stats compatible with Hive. I wonder > if it can reuse engine=hive in that case, or should use a different name > like engine=trino. > > I see Impala gives a unique engine name to metastore. Taking a glance, > Spark is unlikely to be using col stats of Hive directly. > > - https://issues.apache.org/jira/browse/IMPALA-8842 > > (2) Should Hive Metastore use engine=hive as a default value? > > If other compatible software can reuse engine=hive, it could be an option > to accept requests with the old format assuming its engine is "hive" for > compatibility. Or should they explicitly specify engine=hive when using > Hive 4? > > Regards, > Okumin >
Discussion about HIVE-12679 to make IMetaStoreClient pluggable
Hi, I'm working on introducing a feature to make IMetaStoreClient pluggable. I'm sending this e-mail to gather opinions in a visible manner because it has controversial points. Some Hive users need the feature in order to integrate Hive with a data catalog other than HMS. Although the original patch was submitted more than 7 years ago and many users have wanted it, it has not been merged yet. I revived the ticket and PR so that we can maintain or improve it within the community. - JIRA: https://issues.apache.org/jira/browse/HIVE-12679 - PR: https://github.com/apache/hive/pull/ I initially created the above PR based on the original design. That's because I think it is reasonable enough and I can see some users have already ported the patch for the past 7 years. But there are also other opinions to suggest other designs. This is a summary for easy catch-up. https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e If you are interested in this problem and you have any opinions, please put a comment on the Pull Request. Regards, Okumin
Re: How to use `engine` introduced by HIVE-22046
Hi Butao, Thanks for sharing your PR! I didn't find trinodb/trino-hive-apache or trinodb/hive-thrift. As mentioned in the PR, the current Thrift definitions might not be the final version, but it sounds reasonable to give information to external products since we versioned Hive 4 beta. I'm curious if anyone why we give different engine names to Hive and Impala and what are the recommended options. Thanks, Okumin On Fri, Aug 11, 2023 at 10:39 AM Butao Zhang wrote: > Hi, Okumin > > > I have encountered this issue before, and the 'validWriteIdList' is also a > incompatibility parameter. I have submit a PR in trino-hive-apache repo, > and you can refer to https://github.com/trinodb/trino-hive-apache/pull/43 > . > IIUC, the 'engine' parameter is used to differentiate between stats > produced by different engines(Hive&Spark&Presto&Impala), but it seems that > the downstream engines do not want to adopt&realize the new 'engine' > parameter. > At present, if some engines(e.g. Trino) use the customized thrift api to > interact wiht hms, it must change its thrift file to match the thrift > definition of hms. > BTW, maybe we can change hms thrift file to make the 'engine' parameter > optional and then other customized thrift client will not have > compatibility issues. > > Thanks, > > Butao Zhang > > Replied Message > | From | Okumin | > | Date | 8/10/2023 23:41 | > | To | | > | Subject | How to use `engine` introduced by HIVE-22046 | > Hi Hive developers, > > I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm > testing integration between Hive 4 and other software. If I understand > correctly, clients are currently required to additionally specify the > engine name when they get or update column statistics. > > - https://issues.apache.org/jira/browse/HIVE-22046 > - https://github.com/apache/hive/pull/741 > > For example, Trino has a feature to use column stats and it fails. Note > that I am not 100% sure about Trino's implementation or behavior. > > ``` > trino> create table hive.default.test_trino (id int); > Query 20230810_152236_4_t9n6h failed: Required field 'engine' is unset! > Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id], > engine:null) > ``` > > I have two questions about this feature. > > (1) Should any engine use a unique engine name? > > I guess some software can store or use stats compatible with Hive. I wonder > if it can reuse engine=hive in that case, or should use a different name > like engine=trino. > > I see Impala gives a unique engine name to metastore. Taking a glance, > Spark is unlikely to be using col stats of Hive directly. > > - https://issues.apache.org/jira/browse/IMPALA-8842 > > (2) Should Hive Metastore use engine=hive as a default value? > > If other compatible software can reuse engine=hive, it could be an option > to accept requests with the old format assuming its engine is "hive" for > compatibility. Or should they explicitly specify engine=hive when using > Hive 4? > > Regards, > Okumin >
How to use `engine` introduced by HIVE-22046
Hi Hive developers, I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm testing integration between Hive 4 and other software. If I understand correctly, clients are currently required to additionally specify the engine name when they get or update column statistics. - https://issues.apache.org/jira/browse/HIVE-22046 - https://github.com/apache/hive/pull/741 For example, Trino has a feature to use column stats and it fails. Note that I am not 100% sure about Trino's implementation or behavior. ``` trino> create table hive.default.test_trino (id int); Query 20230810_152236_4_t9n6h failed: Required field 'engine' is unset! Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id], engine:null) ``` I have two questions about this feature. (1) Should any engine use a unique engine name? I guess some software can store or use stats compatible with Hive. I wonder if it can reuse engine=hive in that case, or should use a different name like engine=trino. I see Impala gives a unique engine name to metastore. Taking a glance, Spark is unlikely to be using col stats of Hive directly. - https://issues.apache.org/jira/browse/IMPALA-8842 (2) Should Hive Metastore use engine=hive as a default value? If other compatible software can reuse engine=hive, it could be an option to accept requests with the old format assuming its engine is "hive" for compatibility. Or should they explicitly specify engine=hive when using Hive 4? Regards, Okumin
Re: Possibility of supporting TIME or other standard types
Hi Stamatis, Thanks for your reply and +1. Anyway, I also guess there is no semantic reason not to support the type. I created a ticket and may try to work on it when I have a chance. https://issues.apache.org/jira/browse/HIVE-27587 Please let me know if anyone else knows any difficulties or historical reasons why we are not supporting it. Regards, Okumin On Wed, Aug 9, 2023 at 5:50 PM Stamatis Zampetakis wrote: > Hello Okumin, > > As you mentioned the TIME datatype is part of the SQL standard and it > is also supported by many popular DBMS so it definitely makes sense to > add it to Hive. > I guess it was not implemented already cause users were able to store > times using other existing types so it never became a must have. > > +1 on adding TIME data type support in Hive. > > Best, > Stamatis > > On Fri, Aug 4, 2023 at 12:19 PM Okumin wrote: > > > > Hi everyone, > > > > I happened to find some people struggling to store values of the TIME > type > > in a Hive table from another query engine. A viable option is to use > > Iceberg or another format instead of Hive native tables since Hive > doesn't > > directly support the type. I agree that it could be one of the right > > options in this era. > > > > On the other hand, I also think it is a valid request to support TIME > since > > it is one of the types defined in the SQL standard. I expect it is not a > > bad offer also for Hive users if they can process Iceberg's TIME as > Hive's > > TIME, not STRING or other alternatives. I'd like to hear if we have any > > reasons not to support it easily. > > > > I see there is a related ticket but looks like we have not developed TIME > > yet. > > https://issues.apache.org/jira/browse/HIVE-1269 > > > > Regards, > > Okumin >
Possibility of supporting TIME or other standard types
Hi everyone, I happened to find some people struggling to store values of the TIME type in a Hive table from another query engine. A viable option is to use Iceberg or another format instead of Hive native tables since Hive doesn't directly support the type. I agree that it could be one of the right options in this era. On the other hand, I also think it is a valid request to support TIME since it is one of the types defined in the SQL standard. I expect it is not a bad offer also for Hive users if they can process Iceberg's TIME as Hive's TIME, not STRING or other alternatives. I'd like to hear if we have any reasons not to support it easily. I see there is a related ticket but looks like we have not developed TIME yet. https://issues.apache.org/jira/browse/HIVE-1269 Regards, Okumin
Re: Introduce Uniffle : A stability solution of Hive's shuffle
Hi Rory, Let me express my gratitude and positive impression of Uniffle. Actually, we also feel the necessity of a shuffle service for our Hive deployment, and I've been watching the project. I will check the implementation for Tez and send feedback or PRs if I find something. Regards, Okumin On Tue, Jul 11, 2023 at 9:48 PM roryqi wrote: > Dear Apache Hive community, > > > We are delighted to announce the support of Tez on Uniffle. Uniffle havs > supported Apache Spark, Apache,Hadoop MapReduce and Apache Tez. > > Uniffle is a remote shuffle service. In several situations, Uniffle will > provide great help. > >1. If you use AWS spot instances or mix resources, tasks may be >preempted. It will be great if we store shuffle data in the Uniffle and > we >can deploy Uniffle on some stable resource. It will improve the > stability >of tasks. If tasks are preempted, we won’t recompute tasks if we store >shuffle in the Uniffle. >2. For large shuffle jobs, Uniffle can reduce random IO for the jobs. >Uniffle can improve the performance of jobs. For 1TB MapReduce > Terasort, 1w >map tasks, 1w reduce tasks, job performance will increase 30%. > > We also welcome pull requests and are eager to see how you might use > Uniffle to make Hive more user-friendly. More information, you can access > https://github.com/apache/incubator-uniffle > > > Best > > Rory >
[jira] [Created] (HIVE-27120) Warn when Authorizer V2 is configured
okumin created HIVE-27120: - Summary: Warn when Authorizer V2 is configured Key: HIVE-27120 URL: https://issues.apache.org/jira/browse/HIVE-27120 Project: Hive Issue Type: Improvement Reporter: okumin SessionState configures an internal param that is not listed in HiveConf. It causes WARN. {code:java} pod/hive-hiveserver2-7fc4df88b6-dmn8v: 2023-03-01T13:50:38,959 WARN [main] conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
okumin created HIVE-26184: - Summary: COLLECT_SET with GROUP BY is very slow when some keys are highly skewed Key: HIVE-26184 URL: https://issues.apache.org/jira/browse/HIVE-26184 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 3.1.3, 2.3.8 Reporter: okumin Assignee: okumin I observed some reducers spend 98% of CPU time in invoking `java.util.HashMap#clear`. Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` can be quite heavy when a relation has a small number of highly skewed keys. To reproduce the issue, first, we will create rows with a skewed key. {code:java} INSERT INTO test_collect_set SELECT '----' AS key, CAST(UUID() AS VARCHAR) AS value FROM table_with_many_rows LIMIT 10;{code} Then, we will create many non-skewed rows. {code:java} INSERT INTO test_collect_set SELECT UUID() AS key, UUID() AS value FROM sample_datasets.nasdaq LIMIT 500;{code} We can observe the issue when we aggregate values by `key`. {code:java} SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HIVE-25410) CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size
okumin created HIVE-25410: - Summary: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size Key: HIVE-25410 URL: https://issues.apache.org/jira/browse/HIVE-25410 Project: Hive Issue Type: Bug Components: Hive Reporter: okumin Assignee: okumin Fix For: 4.0.0 Thanks to HIVE-24883, CommonMergeJoinOperator can handle ARRAY or STRUCT types as a JOIN key. There are corner cases where CommonMergeJoinOperator fails with `ArrayIndexOutOfBoundsException`. This is a simple case. {code:java} SET hive.auto.convert.join=false; CREATE TABLE table_list_types (id int, key array); INSERT INTO table_list_types VALUES (1, array(1, 2)), (2, array(1, 2)), (3, array(1, 2, 3)), (4, array(1, 2, 3)); SELECT * FROM table_list_types t1 INNER JOIN table_list_types t2 ON t1.key = t2.key; {code} With 69c97c26ac68a245f4d327cc2f7b3a2333f8fa84, the following error happened. {code:java} Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.hadoop.hive.ql.exec.HiveStructComparator.compare(HiveStructComparator.java:57) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKey(CommonMergeJoinOperator.java:629) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKeys(CommonMergeJoinOperator.java:597) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processKey(CommonMergeJoinOperator.java:566) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:249) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370) ... 26 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24606) Multi-stage materialized CTEs can lost intermediate data
okumin created HIVE-24606: - Summary: Multi-stage materialized CTEs can lost intermediate data Key: HIVE-24606 URL: https://issues.apache.org/jira/browse/HIVE-24606 Project: Hive Issue Type: Bug Components: Query Planning Affects Versions: 3.1.2, 2.3.7, 4.0.0 Reporter: okumin Assignee: okumin With complex multi-stage CTEs, Hive can start a latter stage before its previous stage finishes. That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve dependency between multistage materialized CTEs when a non-materialized CTE cuts in. [https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414] For example, when submitting this query, {code:sql} SET hive.optimize.cte.materialize.threshold=2; SET hive.optimize.cte.materialize.full.aggregate.only=false; WITH x AS ( SELECT 'x' AS id ), -- not materialized a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root SELECT * FROM a1 UNION ALL SELECT * FROM x UNION ALL SELECT * FROM a2 UNION ALL SELECT * FROM a2; {code} `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It means the dependency between `a1` and `a2` will be ignored and `a2` can start without waiting for `a1`. As a result, the above query returns the following result. {code:java} +-+ | id | +-+ | a1 | | x | +-+ {code} For your information, I ran this test with revision = 425e1ff7c054f87c4db87e77d004282d529599ae. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24485) Make the slow-start behavior tunable
okumin created HIVE-24485: - Summary: Make the slow-start behavior tunable Key: HIVE-24485 URL: https://issues.apache.org/jira/browse/HIVE-24485 Project: Hive Issue Type: Improvement Components: Hive, Tez Affects Versions: 3.1.2, 4.0.0 Reporter: okumin Assignee: okumin This ticket would enable users to configure the timing of slow-start with `tez.shuffle-vertex-manager.min-src-fraction` and `tez.shuffle-vertex-manager.max-src-fraction`. Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager always uses the default value. We can control the timing to start vertexes the accuracy of estimated input size if we can tweak these ones. This is useful when a vertex has tasks that process a different amount of data. We can reproduce the issue with this query. {code:java} SET hive.tez.auto.reducer.parallelism=true; SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism SET tez.shuffle-vertex-manager.min-src-fraction=0.55; SET tez.shuffle-vertex-manager.max-src-fraction=0.95; CREATE TABLE mofu (name string); INSERT INTO mofu (name) VALUES ('12345'); SELECT name, count(*) FROM mofu GROUP BY name;{code} The fractions are ignored. {code:java} 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 0.75 auto: true desiredTaskIput: 25600 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24286) Render date and time with progress of Hive on Tez
okumin created HIVE-24286: - Summary: Render date and time with progress of Hive on Tez Key: HIVE-24286 URL: https://issues.apache.org/jira/browse/HIVE-24286 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0 Reporter: okumin Assignee: okumin Add date/time to each line written by RenderStrategy like MapReduce and Spark. * [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HadoopJobExecHelper.java#L350] * [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/RenderStrategy.java#L64-L67] This ticket would add the current time to the head of each line. {code:java} 2020-10-19 13:32:41,162 Map 1: 0/1 Reducer 2: 0/1 2020-10-19 13:32:44,231 Map 1: 0/1 Reducer 2: 0/1 2020-10-19 13:32:46,813 Map 1: 0(+1)/1 Reducer 2: 0/1 2020-10-19 13:32:49,878 Map 1: 0(+1)/1 Reducer 2: 0/1 2020-10-19 13:32:51,416 Map 1: 1/1 Reducer 2: 0/1 2020-10-19 13:32:51,936 Map 1: 1/1 Reducer 2: 0(+1)/1 2020-10-19 13:32:52,877 Map 1: 1/1 Reducer 2: 1/1 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24240) Implement missing features in UDTFStatsRule
okumin created HIVE-24240: - Summary: Implement missing features in UDTFStatsRule Key: HIVE-24240 URL: https://issues.apache.org/jira/browse/HIVE-24240 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0 Reporter: okumin Assignee: okumin Add the following steps. * Handle the case in which the num row will be zero * Compute runtime stats in case of a re-execution -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
okumin created HIVE-24203: - Summary: Implement stats annotation rule for the LateralViewJoinOperator Key: HIVE-24203 URL: https://issues.apache.org/jira/browse/HIVE-24203 Project: Hive Issue Type: Improvement Components: Physical Optimizer Affects Versions: 2.3.7, 3.1.2, 4.0.0 Reporter: okumin Assignee: okumin StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW. This can cause an underestimation in case that UDTF in LATERAL VIEW generates multiple rows. HIVE-20262 has already added the rule for UDTF. This issue would add the rule for LateralViewJoinOperator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23831) Enforce auto-parallelism of Tez even if the number of reducers can be 1
okumin created HIVE-23831: - Summary: Enforce auto-parallelism of Tez even if the number of reducers can be 1 Key: HIVE-23831 URL: https://issues.apache.org/jira/browse/HIVE-23831 Project: Hive Issue Type: Improvement Components: Hive Reporter: okumin Assignee: okumin Hive on Tez disables auto-reduce and uses the maximum number of partitions when the minimum number of reducers, `estimated number * min.partition.factor`, is less than or equal to 1. [https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131] https://issues.apache.org/jira/browse/HIVE-14200 This can make Tez lose chance to optimize # of reducers especially in an environment with small `hive.tez.min.partition.factor`. I'd say it would be fine to introduce an option to enforce auto-reduce. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22618) Fix checkstyle violations for ParseUtils
okumin created HIVE-22618: - Summary: Fix checkstyle violations for ParseUtils Key: HIVE-22618 URL: https://issues.apache.org/jira/browse/HIVE-22618 Project: Hive Issue Type: Task Components: Query Processor Reporter: okumin Assignee: okumin ParseUtils.java has incorrect indents and it causes any changes to violate checkstyle. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java#L274] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22601) Some columns will be lost when a UDTF has multiple aliases in some cases
okumin created HIVE-22601: - Summary: Some columns will be lost when a UDTF has multiple aliases in some cases Key: HIVE-22601 URL: https://issues.apache.org/jira/browse/HIVE-22601 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 3.1.2 Reporter: okumin Assignee: okumin Only one column will be retained when putting UDTFs with multiple aliases and a top-level UNION together. For example, the result of the following SQL should have three columns, c1, c2 and c3. {code:java} SELECT stack(1, 'a', 'b', 'c') AS (c1, c2, c3) UNION ALL SELECT stack(1, 'd', 'e', 'f') AS (c1, c2, c3); {code} However, It's only the c3 column which I can get. {code:java} +-+ | _u1.c3 | +-+ | c | | f | +-+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)