Re: Start releasing the master branch

2022-02-09 Thread Szehon Ho
+1 that would be awesome to see Hive master released after so long.

Either 4.0 or 4.0.0-alpha-1 makes sense to me, not sure how we would pick
any 3.x or calendar date (which could tend to slip and be more confusing?).

Thanks in any case to get the ball rolling.
Szehon

On Wed, Feb 9, 2022 at 4:55 AM Zoltan Haindrich  wrote:

> Hey,
>
> Thank you guys for chiming in; versioning is for sure something we should
> get to some common ground.
> Its a triple problem right now; I think we have the following things:
> * storage-api
> ** we have "2.7.3-SNAPSHOT" in the repo
> ***
> https://github.com/apache/hive/blob/0d1cc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
> ** meanwhile we already have 2.8.1 released to maven central
> *** https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
> * standalone-metastore
> ** 4.0.0-SNAPSHOT in the repo
> ** last release is 3.1.2
> * hive
> ** 4.0.0-SNAPSHOT in the repo
> ** last release is 3.1.2
>
> Regarding the actual version number I'm not entirely sure where we should
> start the numbering - that's why I was referring to it as Hive-X in my
> first letter.
>
> I think the key point here would be to start shipping releases regularily
> and not the actual version number we will use - I'll kinda open to any
> versioning scheme which
> reflects that this is a newer release than 3.1.2.
>
> I could imagine the following ones:
> (A) start with something less expected; but keep 3 in the prefix to
> reflect that this is not yet 4.0
>  I can imagine the following numbers:
>  3.900.0, 3.901.0, ...
>  3.9.0, 3.9.1, ...
> (B) start 4.0.0
>  4.0.0, 4.1.0, ...
> (C) jump to some calendar based version number like 2022.2.9
>  trunk based development has pros and cons...making a move like this
> irreversibly pledges trunk based development; and makes release branches
> hard to introduce
> (X) somewhat orthogonal is to (also) use some suffixes
>  4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
>  this is probably the most tempting to use - but this versioning
> schema with a non-changing MINOR and PATCH number will
>  also suggest that the actual software is fully compatible - and only
> bugs are being fixed - which will not be true...
>
> I really like the idea to suffix these releases with alpha or beta - which
> will communicate our level commitment that these are not 100% production
> ready artifacts.
>
> I think we could fix HIVE-25665; and probably experiment with 4.0.0-alpha1
> for start...
>
>  > This also means there should *not* be a branch-4 after releasing Hive
> 4.0
>  > and let that diverge (and becomes the next, super-ignored branch-3),
> correct; no need to keep a branch we don't maintain...but in any case I
> think we can postpone this decision until there will be something to
> release... :)
>
> cheers,
> Zoltan
>
>
>
> On 2/9/22 10:23 AM, László Bodor wrote:
> > Hi All!
> >
> > A purely technical question: what will the SNAPSHOT version become after
> > releasing Hive 4.0.0? I think this is important, as it defines and
> reflects
> > the future release plans.
> >
> > Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + branch-3.
> > Hive is an evolving and super-active project: if we want to make regular
> > releases, we should simply release Hive 4.0 and bump pom to
> 4.1.0-SNAPSHOT,
> > which clearly says that we can release Hive 4.1 anytime we want, without
> > being frustrated about "whether we included enough cool stuff to release
> > 5.0".
> >
> > This also means there should *not* be a branch-4 after releasing Hive 4.0
> > and let that diverge (and becomes the next, super-ignored branch-3), only
> > when we end up bringing a minor backward-incompatible thing that needs a
> > 4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For me,
> a
> > branch called *branch-4.0* doesn't imply either I can expect cool
> releases
> > in the future from there or the branch is maintained and tries to be in
> > sync with the *master*.
> >
> > Regards,
> > Laszlo Bodor
> >
> > Alessandro Solimando  ezt írta (időpont:
> > 2022. febr. 8., K, 16:42):
> >
> >> Hello everyone,
> >> thank you for starting this discussion.
> >>
> >> I agree that releasing the master branch regularly and sufficiently
> often
> >> is welcome and vital for the health of the community.
> >>
> >> It would be great to hear from others too, especially PMC members and
> >> committers, but even simple contributors/followers as myself.
> >>
> >> Best regards,
> >> Alessandro
> >>
> >> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis 
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> Thanks for starting the discussion Zoltan.
> >>>
> >>> I strongly believe that it is important to have regular and often
> >> releases
> >>> otherwise people will create and maintain separate Hive forks.
> >>> The latter is not good for the project and the community may lose
> >> valuable
> >>> members because of it.
> >>>
> >>> Going forward I fully agree that there 

[jira] [Created] (HIVE-25946) select from external table pointing to MySQL returns multiple copies of the same row

2022-02-09 Thread Witold Drabicki (Jira)
Witold Drabicki created HIVE-25946:
--

 Summary: select from external table pointing to MySQL returns 
multiple copies of the same row
 Key: HIVE-25946
 URL: https://issues.apache.org/jira/browse/HIVE-25946
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 2.3.7
 Environment: Hive runs on *GCP Dataproc,* image version is 
*1.5.56-debian10* (Hive v {*}2.3.7{*})

*MySQL* server version is {*}5.7.36{*}.

The following jars are used:
{code:java}
add jar gs://d-test-bucket-1/commons-pool-1.6.jar;
add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar;
add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar;
add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar;  (identical 
behavior when using mysql-connector-java-5.1.49){code}
Reporter: Witold Drabicki


The following table has been created in Hive:

 
{code:java}
CREATE EXTERNAL TABLE table_with_4_rows
(
  col1 varchar(100),
  col2 varchar(15),
  col3 TIMESTAMP,    
  col4 TIMESTAMP
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
    "hive.sql.database.type" = "MYSQL",
    "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
    "hive.sql.jdbc.url" = "jdbc:mysql:///",
    "hive.sql.dbcp.username" = "",
    "hive.sql.dbcp.password" = "",
    "hive.sql.table" = "TABLE_WITH_4_ROWS",
    "hive.sql.schema" = "schema-name",
    "hive.sql.query" = "select col1, col2, col3, col4 from 
schema-name.TABLE_WITH_4_ROWS",
    "hive.sql.numPartitions" = "1",
    "hive.sql.dbcp.maxActive" = "1"
);{code}
 

The table in MySQL has just 4 rows, and is defined as:

 
{code:java}
CREATE TABLE `TABLE_WITH_4_ROWS` (
  `col1` varchar(100) NOT NULL DEFAULT '',
  `col2` varchar(15) DEFAULT NULL,
  `col3` datetime DEFAULT NULL,
  `col4` datetime DEFAULT NULL,
  PRIMARY KEY (`col1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;{code}
 

When cluster is *not 100% busy* and has idle containers, running *select col1, 
col2 from table_with_4_rows* results in a job that uses 49 mappers and no 
reducers, and returns 187 rows, instead of 4 (each original row is duplicated 
multiple times in the results).

Running the same select but with *WHERE col1 = 'specific-value'* also uses 49 
mappers and instead of returning 1 row also returns duplicated data (46 to 48 
rows, depending on the value).

When cluster is *100% busy* and the job needs to reclaim containers from other 
jobs, the above queries use just 1 mapper and *return correct data* (4 and 1 
row, correspondingly).

Running *ANALYZE TABLE table_with_4_rows COMPUTE STATISTICS* does not change 
the results, however, it also works incorrectly as it sets +numRows+ in the 
table's metadata also to 187.

There's *ArrayIndexOutOfBoundsException* *Error during condition build* 
exception thrown during the query execution. Here's the output from the *log* 
file:

 
{code:java}
2022-02-08 20:43:39 Running Dag: dag_1644267138354_0004_1
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not 
allowed in prolog.
Continuing ...
2022-02-08 20:44:03 Completed Dag: dag_1644267138354_0004_1
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set 
#tasks for the vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Vertex will initialize from input initializer. vertex_1644267138354_0004_1_00 
[Map 1]
2022-02-08 20:43:39,900 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Starting 1 inputInitializers for vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,921 [INFO] [Dispatcher thread {Central}] 
|Configuration.deprecation|: mapred.committer.job.setup.cleanup.needed is 
deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
2022-02-08 20:43:39,998 [INFO] [Dispatcher thread {Central}] |conf.HiveConf|: 
Found configuration file null
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] 
|tez.HiveSplitGenerator|: SplitGenerator using llap affinitized locations: false
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] 
|tez.HiveSplitGenerator|: SplitLocationProvider: 
org.apache.hadoop.hive.ql.exec.tez.Utils$1@565d6567
2022-02-08 20:43:40,115 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: 
PLAN PATH = 
hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,125 [INFO] [Dispatcher thread {Central}] 
|exec.SerializationUtilities|: Deserializing MapWork using kryo
2022-02-08 20:43:40,267 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: 
Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
2022-02-08 20:43:40,275 [INFO] [InputInitializer {Map 1} #0] 

[jira] [Created] (HIVE-25945) Upgrade H2 database version to 2.1.210

2022-02-09 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-25945:
--

 Summary: Upgrade H2 database version to 2.1.210
 Key: HIVE-25945
 URL: https://issues.apache.org/jira/browse/HIVE-25945
 Project: Hive
  Issue Type: Task
  Components: Testing Infrastructure
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


The 1.3.166 version, which is in use in Hive, suffers from the following 
security vulnerabilities:
https://nvd.nist.gov/vuln/detail/CVE-2021-42392
https://nvd.nist.gov/vuln/detail/CVE-2022-23221

In the project, we use H2 only for testing purposes (inside the jdbc-handler 
module) thus the H2 binaries are not present in the runtime classpath thus 
these CVEs do not pose a problem for Hive or its users. Nevertheless, it would 
be good to upgrade to a more recent version to avoid Hive coming up in 
vulnerability scans due to this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25944) Format pom.xml-s

2022-02-09 Thread Zoltan Haindrich (Jira)
Zoltan Haindrich created HIVE-25944:
---

 Summary: Format pom.xml-s
 Key: HIVE-25944
 URL: https://issues.apache.org/jira/browse/HIVE-25944
 Project: Hive
  Issue Type: Improvement
Reporter: Zoltan Haindrich
Assignee: Zoltan Haindrich


at the moment I touch pom.xml-s with xmlstarlet it starts fixing indentation 
which makes seeing real diffs harder.

fix and enforce that the pom.xmls are indented correctly



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25943) Introduce compaction cleaner failed attempts threshold

2022-02-09 Thread Jira
László Végh created HIVE-25943:
--

 Summary: Introduce compaction cleaner failed attempts threshold
 Key: HIVE-25943
 URL: https://issues.apache.org/jira/browse/HIVE-25943
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: László Végh


If the cleaner fails for some reason, the compaction entity status remains in 
"ready for cleaning", therefore the cleaner will pick up this entity resulting 
in an endless try. The number of failed cleaning attempts should be counted and 
if they reach a certain threshold the cleaner must skip all the cleaning 
attempts on that compaction entity. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Start releasing the master branch

2022-02-09 Thread Zoltan Haindrich

Hey,

Thank you guys for chiming in; versioning is for sure something we should get 
to some common ground.
Its a triple problem right now; I think we have the following things:
* storage-api
** we have "2.7.3-SNAPSHOT" in the repo
*** 
https://github.com/apache/hive/blob/0d1cc7c5005fe47759298fb35a1c67edc93f/storage-api/pom.xml#L27
** meanwhile we already have 2.8.1 released to maven central
*** https://mvnrepository.com/artifact/org.apache.hive/hive-storage-api
* standalone-metastore
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2
* hive
** 4.0.0-SNAPSHOT in the repo
** last release is 3.1.2

Regarding the actual version number I'm not entirely sure where we should start 
the numbering - that's why I was referring to it as Hive-X in my first letter.

I think the key point here would be to start shipping releases regularily and not the actual version number we will use - I'll kinda open to any versioning scheme which 
reflects that this is a newer release than 3.1.2.


I could imagine the following ones:
(A) start with something less expected; but keep 3 in the prefix to reflect 
that this is not yet 4.0
I can imagine the following numbers:
3.900.0, 3.901.0, ...
3.9.0, 3.9.1, ...
(B) start 4.0.0
4.0.0, 4.1.0, ...
(C) jump to some calendar based version number like 2022.2.9
trunk based development has pros and cons...making a move like this 
irreversibly pledges trunk based development; and makes release branches hard 
to introduce
(X) somewhat orthogonal is to (also) use some suffixes
4.0.0-alpha1, 4.0.0-alpha2, 4.0.0-beta1
this is probably the most tempting to use - but this versioning schema with 
a non-changing MINOR and PATCH number will
also suggest that the actual software is fully compatible - and only bugs 
are being fixed - which will not be true...

I really like the idea to suffix these releases with alpha or beta - which will 
communicate our level commitment that these are not 100% production ready 
artifacts.

I think we could fix HIVE-25665; and probably experiment with 4.0.0-alpha1 for 
start...

> This also means there should *not* be a branch-4 after releasing Hive 4.0
> and let that diverge (and becomes the next, super-ignored branch-3),
correct; no need to keep a branch we don't maintain...but in any case I think 
we can postpone this decision until there will be something to release... :)

cheers,
Zoltan



On 2/9/22 10:23 AM, László Bodor wrote:

Hi All!

A purely technical question: what will the SNAPSHOT version become after
releasing Hive 4.0.0? I think this is important, as it defines and reflects
the future release plans.

Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + branch-3.
Hive is an evolving and super-active project: if we want to make regular
releases, we should simply release Hive 4.0 and bump pom to 4.1.0-SNAPSHOT,
which clearly says that we can release Hive 4.1 anytime we want, without
being frustrated about "whether we included enough cool stuff to release
5.0".

This also means there should *not* be a branch-4 after releasing Hive 4.0
and let that diverge (and becomes the next, super-ignored branch-3), only
when we end up bringing a minor backward-incompatible thing that needs a
4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For me, a
branch called *branch-4.0* doesn't imply either I can expect cool releases
in the future from there or the branch is maintained and tries to be in
sync with the *master*.

Regards,
Laszlo Bodor

Alessandro Solimando  ezt írta (időpont:
2022. febr. 8., K, 16:42):


Hello everyone,
thank you for starting this discussion.

I agree that releasing the master branch regularly and sufficiently often
is welcome and vital for the health of the community.

It would be great to hear from others too, especially PMC members and
committers, but even simple contributors/followers as myself.

Best regards,
Alessandro

On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis 
wrote:


Hello,

Thanks for starting the discussion Zoltan.

I strongly believe that it is important to have regular and often

releases

otherwise people will create and maintain separate Hive forks.
The latter is not good for the project and the community may lose

valuable

members because of it.

Going forward I fully agree that there is no point bringing up strong
blockers for the next release. For sure there are many backward
incompatible changes and possibly unstable features but unless we get a
release out it will be difficult to determine what is broken and what

needs

to be fixed.

Due to the big number of changes that are going to appear in the next
version I would suggest using the terms Hive X-alpha, Hive X-beta for the
first few releases. This will make it clear to the end users that they

need

to be careful when upgrading from an older version and it will give us a
bit more time and freedom to treat issues that the users will likely
discover.

The only real blocker that we may want to treat is 

[jira] [Created] (HIVE-25942) Upgrade commons-io to 2.8.0 due to CVE-2021-29425

2022-02-09 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created HIVE-25942:


 Summary: Upgrade commons-io to 2.8.0 due to CVE-2021-29425
 Key: HIVE-25942
 URL: https://issues.apache.org/jira/browse/HIVE-25942
 Project: Hive
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman
 Fix For: 4.0.0


Due to [CVE-2021-29425|https://nvd.nist.gov/vuln/detail/CVE-2021-29425] all the 
commons-io versions below 2.7 are affected.

Tez and Hadoop have upgraded commons-io to 2.8.0 in 
[TEZ-4353|https://issues.apache.org/jira/browse/TEZ-4353] and 
[HADOOP-17683|https://issues.apache.org/jira/browse/HADOOP-17683] respectively 
and it will be good if Hive also follows the same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Start releasing the master branch

2022-02-09 Thread László Bodor
Hi All!

A purely technical question: what will the SNAPSHOT version become after
releasing Hive 4.0.0? I think this is important, as it defines and reflects
the future release plans.

Currently, it's 4.0.0-SNAPSHOT, I guess it's since Hive 3.0 + branch-3.
Hive is an evolving and super-active project: if we want to make regular
releases, we should simply release Hive 4.0 and bump pom to 4.1.0-SNAPSHOT,
which clearly says that we can release Hive 4.1 anytime we want, without
being frustrated about "whether we included enough cool stuff to release
5.0".

This also means there should *not* be a branch-4 after releasing Hive 4.0
and let that diverge (and becomes the next, super-ignored branch-3), only
when we end up bringing a minor backward-incompatible thing that needs a
4.0.x, and when it happens, we'll create *branch-4.0 *on demand. For me, a
branch called *branch-4.0* doesn't imply either I can expect cool releases
in the future from there or the branch is maintained and tries to be in
sync with the *master*.

Regards,
Laszlo Bodor

Alessandro Solimando  ezt írta (időpont:
2022. febr. 8., K, 16:42):

> Hello everyone,
> thank you for starting this discussion.
>
> I agree that releasing the master branch regularly and sufficiently often
> is welcome and vital for the health of the community.
>
> It would be great to hear from others too, especially PMC members and
> committers, but even simple contributors/followers as myself.
>
> Best regards,
> Alessandro
>
> On Wed, 2 Feb 2022 at 12:22, Stamatis Zampetakis 
> wrote:
>
> > Hello,
> >
> > Thanks for starting the discussion Zoltan.
> >
> > I strongly believe that it is important to have regular and often
> releases
> > otherwise people will create and maintain separate Hive forks.
> > The latter is not good for the project and the community may lose
> valuable
> > members because of it.
> >
> > Going forward I fully agree that there is no point bringing up strong
> > blockers for the next release. For sure there are many backward
> > incompatible changes and possibly unstable features but unless we get a
> > release out it will be difficult to determine what is broken and what
> needs
> > to be fixed.
> >
> > Due to the big number of changes that are going to appear in the next
> > version I would suggest using the terms Hive X-alpha, Hive X-beta for the
> > first few releases. This will make it clear to the end users that they
> need
> > to be careful when upgrading from an older version and it will give us a
> > bit more time and freedom to treat issues that the users will likely
> > discover.
> >
> > The only real blocker that we may want to treat is HIVE-25665 [1] but we
> > can continue the discussion under that ticket and re-evaluate if
> necessary,
> >
> > Best,
> > Stamatis
> >
> > [1] https://issues.apache.org/jira/browse/HIVE-25665
> >
> >
> > On Tue, Feb 1, 2022 at 5:03 PM Zoltan Haindrich  wrote:
> >
> > > Hey All,
> > >
> > > We didn't made a release for a long time now; (3.1.2 was released on 26
> > > August 2019) - and I think because we didn't made that many branch-3
> > > releases; not too many fixes
> > > were ported there - which made that release branch kinda erode away.
> > >
> > > We have a lot of new features/changes in the current master.
> > > I think instead of aiming for big feature-packed releases we should aim
> > > for making a regular release every few months - we should make regular
> > > releases which people could
> > > install and use.
> > > After all releasing Hive after more than 2 years would be big step
> > forward
> > > in itself alone - we have so many improvements that I can't even
> count...
> > >
> > > But I may know not every aspects of the project / states of some
> internal
> > > features - so I would like to ask you:
> > > What would be the bare minimum requirements before we could release the
> > > current master as Hive X?
> > >
> > > There are many nice-to-have-s like:
> > > * hadoop upgrade
> > > * jdk11
> > > * remove HoS or MR
> > > * ?
> > > but I don't think these are blockers...we can make any of these in the
> > > next release if we start making them...
> > >
> > > cheers,
> > > Zoltan
> > >
> >
>


[jira] [Created] (HIVE-25941) Long compilation time of complex query due to analysis for materialized view rewrite

2022-02-09 Thread Krisztian Kasa (Jira)
Krisztian Kasa created HIVE-25941:
-

 Summary: Long compilation time of complex query due to analysis 
for materialized view rewrite
 Key: HIVE-25941
 URL: https://issues.apache.org/jira/browse/HIVE-25941
 Project: Hive
  Issue Type: Bug
  Components: Materialized views
Reporter: Krisztian Kasa
Assignee: Krisztian Kasa


When compiling query the optimizer tries to rewrite the query plan or subtrees 
of the plan to use materialized view scans.

If
{code}
set hive.materializedview.rewriting.sql.subquery=false;
{code}
the compilation succeed in less then 10 sec otherwise it takes several minutes 
(~ 5min) depending on the hardware.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)