Re: Start releasing the master branch
+1 that would be awesome to see Hive master released after so long. Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would pick any 3.x or calendar date (which could tend to slip and be more confusing?). Thanks in any case to get the ball rolling. Szehon On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich wrote: > Hey, > > Thank you guys for chiming in; versioning is for sure something we should > get to some common ground. > Its a triple problem right now; I think we have the following things: > * storage-api > ** we have "2.7.3-SNAPSHOT" in the repo > *** > https://github.com/apache/hive/blob/0d1cc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27 > ** meanwhile we already have 2.8.1 released to maven central > *** https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api > * standalone-metastore > ** 4.0.0-SNAPSHOT in the repo > ** last release is 3.1.2 > * hive > ** 4.0.0-SNAPSHOT in the repo > ** last release is 3.1.2 > > Regarding the actual version number I'm not entirely sure where we should > start the numbering - that's why I was referring to it as Hive-X in my > first letter. > > I think the key point here would be to start shipping releases regularily > and not the actual version number we will use - I'll kinda open to any > versioning scheme which > reflects that this is a newer release than 3.1.2. > > I could imagine the following ones: > (A) start with something less expected; but keep 3 in the prefix to > reflect that this is not yet 4.0 > I can imagine the following numbers: > 3.900.0, 3.901.0, ... > 3.9.0, 3.9.1, ... > (B) start 4.0.0 > 4.0.0, 4.1.0, ... > (C) jump to some calendar based version number like 2022.2.9 > trunk based development has pros and cons...making a move like this > irreversibly pledges trunk based development; and makes release branches > hard to introduce > (X) somewhat orthogonal is to (also) use some suffixes > 4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1 > this is probably the most tempting to use - but this versioning > schema with a non-changing MINOR and PATCH number will > also suggest that the actual software is fully compatible - and only > bugs are being fixed - which will not be true... > > I really like the idea to suffix these releases with alpha or beta - which > will communicate our level commitment that these are not 100% production > ready artifacts. > > I think we could fix HIVE-25665; and probably experiment with 4.0.0-alpha1 > for start... > > > This also means there should *not* be a branch-4 after releasing Hive > 4.0 > > and let that diverge (and becomes the next, super-ignored branch-3), > correct; no need to keep a branch we don't maintain...but in any case I > think we can postpone this decision until there will be something to > release... :) > > cheers, > Zoltan > > > > On 2/9/22 10:23 AM, László Bodor wrote: > > Hi All! > > > > A purely technical question: what will the SNAPSHOT version become after > > releasing Hive 4.0.0? I think this is important, as it defines and > reflects > > the future release plans. > > > > Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + branch-3. > > Hive is an evolving and super-active project: if we want to make regular > > releases, we should simply release Hive 4.0 and bump pom to > 4.1.0-SNAPSHOT, > > which clearly says that we can release Hive 4.1 anytime we want, without > > being frustrated about "whether we included enough cool stuff to release > > 5.0". > > > > This also means there should *not* be a branch-4 after releasing Hive 4.0 > > and let that diverge (and becomes the next, super-ignored branch-3), only > > when we end up bringing a minor backward-incompatible thing that needs a > > 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For me, > a > > branch called *branch-4.0* doesn't imply either I can expect cool > releases > > in the future from there or the branch is maintained and tries to be in > > sync with the *master*. > > > > Regards, > > Laszlo Bodor > > > > Alessandro Solimando ezt írta (időpont: > > 2022. febr. 8., K, 16:42): > > > >> Hello everyone, > >> thank you for starting this discussion. > >> > >> I agree that releasing the master branch regularly and sufficiently > often > >> is welcome and vital for the health of the community. > >> > >> It would be great to hear from others too, especially PMC members and > >> committers, but even simple contributors/followers as myself. > >> > >> Best regards, > >> Alessandro > >> > >> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis > >> wrote: > >> > >>> Hello, > >>> > >>> Thanks for starting the discussion Zoltan. > >>> > >>> I strongly believe that it is important to have regular and often > >> releases > >>> otherwise people will create and maintain separate Hive forks. > >>> The latter is not good for the project and the community may lose > >> valuable > >>> members because of it. > >>> > >>> Going forward I fully agree that there
[jira] [Created] (HIVE-25946) select from external table pointing to MySQL returns multiple copies of the same row
Witold Drabicki created HIVE-25946: -- Summary: select from external table pointing to MySQL returns multiple copies of the same row Key: HIVE-25946 URL: https://issues.apache.org/jira/browse/HIVE-25946 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.3.7 Environment: Hive runs on *GCP Dataproc,* image version is *1.5.56-debian10* (Hive v {*}2.3.7{*}) *MySQL* server version is {*}5.7.36{*}. The following jars are used: {code:java} add jar gs://d-test-bucket-1/commons-pool-1.6.jar; add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar; add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar; add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical behavior when using mysql-connector-java-5.1.49){code} Reporter: Witold Drabicki The following table has been created in Hive: {code:java} CREATE EXTERNAL TABLE table_with_4_rows ( col1 varchar(100), col2 varchar(15), col3 TIMESTAMP, col4 TIMESTAMP ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "MYSQL", "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver", "hive.sql.jdbc.url" = "jdbc:mysql:///", "hive.sql.dbcp.username" = "", "hive.sql.dbcp.password" = "", "hive.sql.table" = "TABLE_WITH_4_ROWS", "hive.sql.schema" = "schema-name", "hive.sql.query" = "select col1, col2, col3, col4 from schema-name.TABLE_WITH_4_ROWS", "hive.sql.numPartitions" = "1", "hive.sql.dbcp.maxActive" = "1" );{code} The table in MySQL has just 4 rows, and is defined as: {code:java} CREATE TABLE `TABLE_WITH_4_ROWS` ( `col1` varchar(100) NOT NULL DEFAULT '', `col2` varchar(15) DEFAULT NULL, `col3` datetime DEFAULT NULL, `col4` datetime DEFAULT NULL, PRIMARY KEY (`col1`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;{code} When cluster is *not 100% busy* and has idle containers, running *select col1, col2 from table_with_4_rows* results in a job that uses 49 mappers and no reducers, and returns 187 rows, instead of 4 (each original row is duplicated multiple times in the results). Running the same select but with *WHERE col1 = 'specific-value'* also uses 49 mappers and instead of returning 1 row also returns duplicated data (46 to 48 rows, depending on the value). When cluster is *100% busy* and the job needs to reclaim containers from other jobs, the above queries use just 1 mapper and *return correct data* (4 and 1 row, correspondingly). Running *ANALYZE TABLE table_with_4_rows COMPUTE STATISTICS* does not change the results, however, it also works incorrectly as it sets +numRows+ in the table's metadata also to 187. There's *ArrayIndexOutOfBoundsException* *Error during condition build* exception thrown during the query execution. Here's the output from the *log* file: {code:java} 2022-02-08 20:43:39 Running Dag: dag_1644267138354_0004_1 org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. Continuing ... 2022-02-08 20:44:03 Completed Dag: dag_1644267138354_0004_1 2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set #tasks for the vertex vertex_1644267138354_0004_1_00 [Map 1] 2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex will initialize from input initializer. vertex_1644267138354_0004_1_00 [Map 1] 2022-02-08 20:43:39,900 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Starting 1 inputInitializers for vertex vertex_1644267138354_0004_1_00 [Map 1] 2022-02-08 20:43:39,921 [INFO] [Dispatcher thread {Central}] |Configuration.deprecation|: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed 2022-02-08 20:43:39,998 [INFO] [Dispatcher thread {Central}] |conf.HiveConf|: Found configuration file null 2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitGenerator using llap affinitized locations: false 2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitLocationProvider: org.apache.hadoop.hive.ql.exec.tez.Utils$1@565d6567 2022-02-08 20:43:40,115 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml 2022-02-08 20:43:40,125 [INFO] [Dispatcher thread {Central}] |exec.SerializationUtilities|: Deserializing MapWork using kryo 2022-02-08 20:43:40,267 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: Deserialized plan (via RPC) - name: Map 1 size: 3.61KB 2022-02-08 20:43:40,275 [INFO] [InputInitializer {Map 1} #0]
[jira] [Created] (HIVE-25945) Upgrade H2 database version to 2.1.210
Stamatis Zampetakis created HIVE-25945: -- Summary: Upgrade H2 database version to 2.1.210 Key: HIVE-25945 URL: https://issues.apache.org/jira/browse/HIVE-25945 Project: Hive Issue Type: Task Components: Testing Infrastructure Reporter: Stamatis Zampetakis Assignee: Stamatis Zampetakis The 1.3.166 version, which is in use in Hive, suffers from the following security vulnerabilities: https://nvd.nist.gov/vuln/detail/CVE-2021-42392 https://nvd.nist.gov/vuln/detail/CVE-2022-23221 In the project, we use H2 only for testing purposes (inside the jdbc-handler module) thus the H2 binaries are not present in the runtime classpath thus these CVEs do not pose a problem for Hive or its users. Nevertheless, it would be good to upgrade to a more recent version to avoid Hive coming up in vulnerability scans due to this. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25944) Format pom.xml-s
Zoltan Haindrich created HIVE-25944: --- Summary: Format pom.xml-s Key: HIVE-25944 URL: https://issues.apache.org/jira/browse/HIVE-25944 Project: Hive Issue Type: Improvement Reporter: Zoltan Haindrich Assignee: Zoltan Haindrich at the moment I touch pom.xml-s with xmlstarlet it starts fixing indentation which makes seeing real diffs harder. fix and enforce that the pom.xmls are indented correctly -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25943) Introduce compaction cleaner failed attempts threshold
László Végh created HIVE-25943: -- Summary: Introduce compaction cleaner failed attempts threshold Key: HIVE-25943 URL: https://issues.apache.org/jira/browse/HIVE-25943 Project: Hive Issue Type: Improvement Components: Hive Reporter: László Végh If the cleaner fails for some reason, the compaction entity status remains in "ready for cleaning", therefore the cleaner will pick up this entity resulting in an endless try. The number of failed cleaning attempts should be counted and if they reach a certain threshold the cleaner must skip all the cleaning attempts on that compaction entity. -- This message was sent by Atlassian Jira (v8.20.1#820001)
Re: Start releasing the master branch
Hey, Thank you guys for chiming in; versioning is for sure something we should get to some common ground. Its a triple problem right now; I think we have the following things: * storage-api ** we have "2.7.3-SNAPSHOT" in the repo *** https://github.com/apache/hive/blob/0d1cc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27 ** meanwhile we already have 2.8.1 released to maven central *** https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api * standalone-metastore ** 4.0.0-SNAPSHOT in the repo ** last release is 3.1.2 * hive ** 4.0.0-SNAPSHOT in the repo ** last release is 3.1.2 Regarding the actual version number I'm not entirely sure where we should start the numbering - that's why I was referring to it as Hive-X in my first letter. I think the key point here would be to start shipping releases regularily and not the actual version number we will use - I'll kinda open to any versioning scheme which reflects that this is a newer release than 3.1.2. I could imagine the following ones: (A) start with something less expected; but keep 3 in the prefix to reflect that this is not yet 4.0 I can imagine the following numbers: 3.900.0, 3.901.0, ... 3.9.0, 3.9.1, ... (B) start 4.0.0 4.0.0, 4.1.0, ... (C) jump to some calendar based version number like 2022.2.9 trunk based development has pros and cons...making a move like this irreversibly pledges trunk based development; and makes release branches hard to introduce (X) somewhat orthogonal is to (also) use some suffixes 4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1 this is probably the most tempting to use - but this versioning schema with a non-changing MINOR and PATCH number will also suggest that the actual software is fully compatible - and only bugs are being fixed - which will not be true... I really like the idea to suffix these releases with alpha or beta - which will communicate our level commitment that these are not 100% production ready artifacts. I think we could fix HIVE-25665; and probably experiment with 4.0.0-alpha1 for start... > This also means there should *not* be a branch-4 after releasing Hive 4.0 > and let that diverge (and becomes the next, super-ignored branch-3), correct; no need to keep a branch we don't maintain...but in any case I think we can postpone this decision until there will be something to release... :) cheers, Zoltan On 2/9/22 10:23 AM, László Bodor wrote: Hi All! A purely technical question: what will the SNAPSHOT version become after releasing Hive 4.0.0? I think this is important, as it defines and reflects the future release plans. Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + branch-3. Hive is an evolving and super-active project: if we want to make regular releases, we should simply release Hive 4.0 and bump pom to 4.1.0-SNAPSHOT, which clearly says that we can release Hive 4.1 anytime we want, without being frustrated about "whether we included enough cool stuff to release 5.0". This also means there should *not* be a branch-4 after releasing Hive 4.0 and let that diverge (and becomes the next, super-ignored branch-3), only when we end up bringing a minor backward-incompatible thing that needs a 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For me, a branch called *branch-4.0* doesn't imply either I can expect cool releases in the future from there or the branch is maintained and tries to be in sync with the *master*. Regards, Laszlo Bodor Alessandro Solimando ezt írta (időpont: 2022. febr. 8., K, 16:42): Hello everyone, thank you for starting this discussion. I agree that releasing the master branch regularly and sufficiently often is welcome and vital for the health of the community. It would be great to hear from others too, especially PMC members and committers, but even simple contributors/followers as myself. Best regards, Alessandro On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis wrote: Hello, Thanks for starting the discussion Zoltan. I strongly believe that it is important to have regular and often releases otherwise people will create and maintain separate Hive forks. The latter is not good for the project and the community may lose valuable members because of it. Going forward I fully agree that there is no point bringing up strong blockers for the next release. For sure there are many backward incompatible changes and possibly unstable features but unless we get a release out it will be difficult to determine what is broken and what needs to be fixed. Due to the big number of changes that are going to appear in the next version I would suggest using the terms Hive X-alpha, Hive X-beta for the first few releases. This will make it clear to the end users that they need to be careful when upgrading from an older version and it will give us a bit more time and freedom to treat issues that the users will likely discover. The only real blocker that we may want to treat is
[jira] [Created] (HIVE-25942) Upgrade commons-io to 2.8.0 due to CVE-2021-29425
Syed Shameerur Rahman created HIVE-25942: Summary: Upgrade commons-io to 2.8.0 due to CVE-2021-29425 Key: HIVE-25942 URL: https://issues.apache.org/jira/browse/HIVE-25942 Project: Hive Issue Type: Bug Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Due to [CVE-2021-29425|https://nvd.nist.gov/vuln/detail/CVE-2021-29425] all the commons-io versions below 2.7 are affected. Tez and Hadoop have upgraded commons-io to 2.8.0 in [TEZ-4353|https://issues.apache.org/jira/browse/TEZ-4353] and [HADOOP-17683|https://issues.apache.org/jira/browse/HADOOP-17683] respectively and it will be good if Hive also follows the same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
Re: Start releasing the master branch
Hi All! A purely technical question: what will the SNAPSHOT version become after releasing Hive 4.0.0? I think this is important, as it defines and reflects the future release plans. Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + branch-3. Hive is an evolving and super-active project: if we want to make regular releases, we should simply release Hive 4.0 and bump pom to 4.1.0-SNAPSHOT, which clearly says that we can release Hive 4.1 anytime we want, without being frustrated about "whether we included enough cool stuff to release 5.0". This also means there should *not* be a branch-4 after releasing Hive 4.0 and let that diverge (and becomes the next, super-ignored branch-3), only when we end up bringing a minor backward-incompatible thing that needs a 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For me, a branch called *branch-4.0* doesn't imply either I can expect cool releases in the future from there or the branch is maintained and tries to be in sync with the *master*. Regards, Laszlo Bodor Alessandro Solimando ezt írta (időpont: 2022. febr. 8., K, 16:42): > Hello everyone, > thank you for starting this discussion. > > I agree that releasing the master branch regularly and sufficiently often > is welcome and vital for the health of the community. > > It would be great to hear from others too, especially PMC members and > committers, but even simple contributors/followers as myself. > > Best regards, > Alessandro > > On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis > wrote: > > > Hello, > > > > Thanks for starting the discussion Zoltan. > > > > I strongly believe that it is important to have regular and often > releases > > otherwise people will create and maintain separate Hive forks. > > The latter is not good for the project and the community may lose > valuable > > members because of it. > > > > Going forward I fully agree that there is no point bringing up strong > > blockers for the next release. For sure there are many backward > > incompatible changes and possibly unstable features but unless we get a > > release out it will be difficult to determine what is broken and what > needs > > to be fixed. > > > > Due to the big number of changes that are going to appear in the next > > version I would suggest using the terms Hive X-alpha, Hive X-beta for the > > first few releases. This will make it clear to the end users that they > need > > to be careful when upgrading from an older version and it will give us a > > bit more time and freedom to treat issues that the users will likely > > discover. > > > > The only real blocker that we may want to treat is HIVE-25665 [1] but we > > can continue the discussion under that ticket and re-evaluate if > necessary, > > > > Best, > > Stamatis > > > > [1] https://issues.apache.org/jira/browse/HIVE-25665 > > > > > > On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich wrote: > > > > > Hey All, > > > > > > We didn't made a release for a long time now; (3.1.2 was released on 26 > > > August 2019) - and I think because we didn't made that many branch-3 > > > releases; not too many fixes > > > were ported there - which made that release branch kinda erode away. > > > > > > We have a lot of new features/changes in the current master. > > > I think instead of aiming for big feature-packed releases we should aim > > > for making a regular release every few months - we should make regular > > > releases which people could > > > install and use. > > > After all releasing Hive after more than 2 years would be big step > > forward > > > in itself alone - we have so many improvements that I can't even > count... > > > > > > But I may know not every aspects of the project / states of some > internal > > > features - so I would like to ask you: > > > What would be the bare minimum requirements before we could release the > > > current master as Hive X? > > > > > > There are many nice-to-have-s like: > > > * hadoop upgrade > > > * jdk11 > > > * remove HoS or MR > > > * ? > > > but I don't think these are blockers...we can make any of these in the > > > next release if we start making them... > > > > > > cheers, > > > Zoltan > > > > > >
[jira] [Created] (HIVE-25941) Long compilation time of complex query due to analysis for materialized view rewrite
Krisztian Kasa created HIVE-25941: - Summary: Long compilation time of complex query due to analysis for materialized view rewrite Key: HIVE-25941 URL: https://issues.apache.org/jira/browse/HIVE-25941 Project: Hive Issue Type: Bug Components: Materialized views Reporter: Krisztian Kasa Assignee: Krisztian Kasa When compiling query the optimizer tries to rewrite the query plan or subtrees of the plan to use materialized view scans. If {code} set hive.materializedview.rewriting.sql.subquery=false; {code} the compilation succeed in less then 10 sec otherwise it takes several minutes (~ 5min) depending on the hardware. -- This message was sent by Atlassian Jira (v8.20.1#820001)