[jira] [Created] (HIVE-19824) Improve online datasize estimations for MapJoins

2018-06-07 Thread Zoltan Haindrich (JIRA)
Zoltan Haindrich created HIVE-19824:
---

 Summary: Improve online datasize estimations for MapJoins
 Key: HIVE-19824
 URL: https://issues.apache.org/jira/browse/HIVE-19824
 Project: Hive
  Issue Type: Improvement
Reporter: Zoltan Haindrich
Assignee: Zoltan Haindrich


Statistics.datasize() only accounts for "real" data size; but for example 
handling 1M rows might introduce some datastructure overhead...if the "real" 
data is small - even this overhead might become the real memory usage

for 6.5M rows of (int,int) the estimation is 52MB
in reality this eats up ~260MB from which 210MB is used to service the hashmap 
functionality to that many rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] hive pull request #369: HIVE-19723: Arrow serde: "Unsupported data type: Tim...

2018-06-07 Thread pudidic
GitHub user pudidic opened a pull request:

https://github.com/apache/hive/pull/369

HIVE-19723: Arrow serde: "Unsupported data type: Timestamp(NANOSECOND, 
null)"

This pull request added a randomized unit test, supports microsecond for 
Spark integration, and changed TestJdbcWithMiniLlapArrow to test microsecond.
The previous pull request was hard to merge, due to some reverted conflicts 
on Apache master branch.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pudidic/hive HIVE-19723

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #369


commit 1600141292aef67105a2df0435bd0bab52b1e4e3
Author: Teddy Choi 
Date:   2018-06-07T16:14:08Z

HIVE-19723: Arrow serde: "Unsupported data type: Timestamp(NANOSECOND, 
null)"




---


[GitHub] hive pull request #360: HIVE-19723: Arrow serde: "Unsupported data type: Tim...

2018-06-07 Thread pudidic
Github user pudidic closed the pull request at:

https://github.com/apache/hive/pull/360


---


Re: [DISCUSS] Release of standalone-metastore

2018-06-07 Thread Alan Gates
I have pushed the standalone metastore src and bin tarballs and their
signatures and hashes into Hive's dist area, so they should soon be
available for download.  Congrats to all who worked on this!

As part of creating a release tag for the standalone metastore I noticed we
didn't have one for release 3.0.0, so I created a tag for that as well.

Alan.

On Tue, Jun 5, 2018 at 10:45 AM Alan Gates  wrote:

> I have put the binary and source objects up at
> https://home.apache.org/~gates/hive-standalone-metastore-3.0.0/ so
> everyone can take a look before I officially push them to dist.
>
> I don't think we need to vote on this as we have already officially
> released these objects, I'm just adding sha and gpg signatures for download
> purposes.  But, please take a look and make sure I did everything
> properly.  I'll push them to dist after a couple of days to give everyone a
> chance to look them over.
>
> Alan.
>
> On Wed, May 30, 2018 at 11:00 AM Vihang Karajgaonkar 
> wrote:
>
>> The proposal to post the source and bin to the distribution sounds good to
>> me. We can do the testing and release standalone-metastore 3.1 like to you
>> suggested above.
>>
>> On Tue, May 29, 2018 at 10:49 PM, Peter Vary  wrote:
>>
>> > What do you think about adding a ne profile, which adds a possibility to
>> > compile the code with one command, until we separate standalone
>> metastore
>> > to a new project? Like -Pitests, but -Pmetastore. So "mvn clean install
>> > -Pmetastore,itests" will compile everything.
>> >
>> > Alan Gates  ezt írta (időpont: 2018. máj. 30.,
>> Sze
>> > 0:42):
>> >
>> > > On Tue, May 29, 2018 at 3:29 PM Vihang Karajgaonkar <
>> vih...@cloudera.com
>> > >
>> > > wrote:
>> > >
>> > > > How about cutting out a branch-3.0.1 and releasing 3.0.1 with the
>> > pom.xml
>> > > > fixed? My concern with above approach is we haven't tested
>> > > > standalone-metastore when deployed independent of Hive.
>> > >
>> > > ​Actually, there is.  The tarballs for source and bin are already out
>> > > there.  If I post them on the distribution site then they'll be
>> easier to
>> > > find.  So we can test that now.  And we can then do a 3.1 release of
>> the
>> > > metastore whenever we want, as long as it's before a 3.1 release of
>> Hive.
>> > >
>> > > Alan.​
>> > >
>> > >
>> > > > So we don't know if
>> > > > there is something is fundamentally broken in that mode and given
>> that
>> > we
>> > > > don't know when 3.1 is going to be released it may remain in that
>> state
>> > > for
>> > > > long time which is not good. I think may be a good approach now
>> would
>> > be
>> > > to
>> > > > test 3.0 standalone-metastore and fix any issues along with the
>> pom.xml
>> > > > changes and do a 3.0.1 release. What do you think?
>> > > >
>> > > > Thanks,
>> > > > Vihang
>> > > >
>> > > > On Tue, May 29, 2018 at 1:57 PM, Alan Gates 
>> > > wrote:
>> > > >
>> > > > > In the thread on releasing Hive 3.0 I wrote
>> > > > > 
>> > > > > We should work on producing a standalone-metastore
>> > > > > release in the same time frame so that the schema's, etc. match. I
>> > can
>> > > RM
>> > > > > that unless someone else wants to.
>> > > > > 
>> > > > > https://lists.apache.org/thread.html/307b281c3742fdf6aeb7fac
>> > > > > 3ee74a98830400b67711755572de15b80@%3Cdev.hive.apache.org%3E
>> > > > >
>> > > > > My thinking was to produce a separate metastore release, like we
>> do
>> > for
>> > > > > storage-api.  However, I missed that I needed to do some work in
>> > > > branch-3.0
>> > > > > to disconnect standalone-metastore from the pom before the release
>> > (in
>> > > > the
>> > > > > same way that storage-api does).  Thus when we released Hive 3.0
>> we
>> > > also
>> > > > > released the standalone-metastore. See
>> > > > > https://search.maven.org/#search%7Cga%7C2%7Cg%3A%22org.
>> > apache.hive%22
>> > > >  So
>> > > > > I can't release another version of standalone-metastore 3.0.
>> Here is
>> > > > what
>> > > > > I propose we do:
>> > > > >
>> > > > >
>> > > > >1. Put the src and bin tarballs for standalone-metastore in
>> Hive's
>> > > > >distribution site.  We have already voted on these as part of
>> 3.0
>> > > > > release
>> > > > >process.
>> > > > >2. Like storage-api, we keep the standalone-metastore linked in
>> > the
>> > > > pom
>> > > > >in the master branch.  This makes life easier for developers as
>> > they
>> > > > >produce new patches.
>> > > > >3. Also like storage-api, at some future point before we
>> release
>> > > Hive
>> > > > >3.1 I will:
>> > > > >   1. Make a separate branch for standalone-metastore from
>> > branch-3
>> > > > >   2. Release a standalone-metastore 3.1 from this new branch
>> > > > >   3. Remove standalone-metastore from the list of sub-modules
>> in
>> > > > Hive's
>> > > > >   pom.xml
>> > > > >   4. Make Hive depend on the released 3.1 version of the
>> > > > >   standalone-metastore.
>> > > > >4. For branch-3.0, I do not 

[jira] [Created] (HIVE-19825) HiveServer2 leader selection shall use different zookeeper znode

2018-06-07 Thread Daniel Dai (JIRA)
Daniel Dai created HIVE-19825:
-

 Summary: HiveServer2 leader selection shall use different 
zookeeper znode
 Key: HIVE-19825
 URL: https://issues.apache.org/jira/browse/HIVE-19825
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Daniel Dai
Assignee: Daniel Dai


Currently, HiveServer2 leader selection (used only by privilegesynchronizer 
now) is reuse /hiveserver2 parent znode which is already used for HiveServer2 
service discovery. This interfere the service discovery. I'd like to switch to 
a different znode /hiveserver2-leader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Cleaning up old version in dist

2018-06-07 Thread Alan Gates
Apache asks that we keep at most 2 current versions in dist, to minimize
the space we take up on distribution mirrors.  Since we are running
multiple lines and a have a couple of separately releasable modules we'll
have more than 2 versions there.  But we have old versions of Hive 2 (2.1,
2.2) and of the storage-api (2.4, 2.5).  I think we should remove these.
That will leave us with the most up to date versions of Hive 1, 2, 3, the
storage api, and the standalone metastore.  Note that this does not affect
their availability in maven central or the apache archive.

Alan.


Re: Review Request 67263: HIVE-19602

2018-06-07 Thread Bharathkrishna Guruvayoor Murali via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67263/
---

(Updated June 7, 2018, 10:43 p.m.)


Review request for hive, Sahil Takiar and Vihang Karajgaonkar.


Changes
---

Making changes to be in sync with HIVE-19508 so that it does not cause merge 
conflicts with master


Bugs: HIVE-19602
https://issues.apache.org/jira/browse/HIVE-19602


Repository: hive-git


Description
---

Refactor inplace progress code in Hive-on-spark progress monitor to use 
ProgressMonitor instance


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobMonitor.java 
e78b1cd6637c46070378c25a372916817fe99a59 
  
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkProgressMonitor.java
 PRE-CREATION 


Diff: https://reviews.apache.org/r/67263/diff/5/

Changes: https://reviews.apache.org/r/67263/diff/4-5/


Testing
---


Thanks,

Bharathkrishna Guruvayoor Murali



Re: Cleaning up old version in dist

2018-06-07 Thread Thejas Nair
+1

On Thu, Jun 7, 2018 at 11:13 AM, Alan Gates  wrote:
> Apache asks that we keep at most 2 current versions in dist, to minimize
> the space we take up on distribution mirrors.  Since we are running
> multiple lines and a have a couple of separately releasable modules we'll
> have more than 2 versions there.  But we have old versions of Hive 2 (2.1,
> 2.2) and of the storage-api (2.4, 2.5).  I think we should remove these.
> That will leave us with the most up to date versions of Hive 1, 2, 3, the
> storage api, and the standalone metastore.  Note that this does not affect
> their availability in maven central or the apache archive.
>
> Alan.


[jira] [Created] (HIVE-19826) OrcRawRecordMerger doesn't work for more than one file

2018-06-07 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-19826:
---

 Summary: OrcRawRecordMerger doesn't work for more than one file
 Key: HIVE-19826
 URL: https://issues.apache.org/jira/browse/HIVE-19826
 Project: Hive
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin


Key object in the map is reused and reset, leading to bizarre merges and wrong 
results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Review Request 67497: HIVE-19794: Disable removing order by from subquery in GenericUDTFGetSplits

2018-06-07 Thread j . prasanth . j

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67497/
---

Review request for hive and Jason Dere.


Bugs: HIVE-19794
https://issues.apache.org/jira/browse/HIVE-19794


Repository: hive-git


Description
---

HIVE-19794: Disable removing order by from subquery in GenericUDTFGetSplits


Diffs
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
dd42fd127e633304a2da499afa60f7b051d329a9 
  
itests/hive-unit/src/test/java/org/apache/hive/jdbc/TestJdbcGenericUDTFGetSplits.java
 PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java 
57f6c66a56a88bb7383ebe5832bba75240dea554 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDTFGetSplits.java 
20d09611ccdf863d5a5e7dc811efe091f7b4aba2 


Diff: https://reviews.apache.org/r/67497/diff/1/


Testing
---


Thanks,

Prasanth_J



Review Request 67484: HIVE-19782 Flash out TestObjectStore.testDirectSQLDropParitionsCleanup

2018-06-07 Thread Peter Vary via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67484/
---

Review request for hive, Alexander Kolbasov and Vihang Karajgaonkar.


Bugs: HIVE-19782
https://issues.apache.org/jira/browse/HIVE-19782


Repository: hive-git


Description
---

Updated test table/partition generation so we can insert into every related 
table, not just the basic ones.
Use this only when testing the table cleanup, so we save a minimal time on 
tests.
Fixed an exiting bug in HiveObjectRefBuilder.java find by the tests
Added a possibility to add PartitionColumnReference to HiveObjectRefBuilder.java


Diffs
-

  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/client/builder/HiveObjectRefBuilder.java
 62a227a 
  
standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/TestObjectStore.java
 7984af6 


Diff: https://reviews.apache.org/r/67484/diff/1/


Testing
---

Run the TestObjectStore.java tests


Thanks,

Peter Vary



Review Request 67485: HIVE-19783 Retrieve only locations in HiveMetaStore.dropPartitionsAndGetLocations

2018-06-07 Thread Peter Vary via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67485/
---

Review request for hive, Alexander Kolbasov and Vihang Karajgaonkar.


Bugs: HIVE-19783
https://issues.apache.org/jira/browse/HIVE-19783


Repository: hive-git


Description
---

Added a new getPartitionLocations method to the RawStore interface.

Implemented getPartitionLocations in ObjectStore using JDQL.
Question: In CachedObjectStore: Shall I call rawStore.getPartitionLocations or 
reimplement it using getPartitions?

Modified dropPartitionsAndGetLocations:
- Instead of querying every partition data. Query only the locations using the 
new interface method
- Removed partKeys parameter which become unneccessary


Diffs
-

  
itests/hcatalog-unit/src/test/java/org/apache/hive/hcatalog/listener/DummyRawStoreFailEvent.java
 0cc0ae5 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
 d8b8414 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
 b15d89d 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/RawStore.java
 283798c 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/cache/CachedStore.java
 9da8d72 
  
standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java
 0461c4e 
  
standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
 b71eda4 


Diff: https://reviews.apache.org/r/67485/diff/1/


Testing
---

Run the TestTablesCreateDropAlterTruncate test (partitioned table creation and 
drop)


Thanks,

Peter Vary



Re: Review Request 67485: HIVE-19783 Retrieve only locations in HiveMetaStore.dropPartitionsAndGetLocations

2018-06-07 Thread Peter Vary via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67485/
---

(Updated June 7, 2018, 10:31 a.m.)


Review request for hive, Alexander Kolbasov and Vihang Karajgaonkar.


Changes
---

Null locations are possible. Handle those as well


Bugs: HIVE-19783
https://issues.apache.org/jira/browse/HIVE-19783


Repository: hive-git


Description
---

Added a new getPartitionLocations method to the RawStore interface.

Implemented getPartitionLocations in ObjectStore using JDQL.
Question: In CachedObjectStore: Shall I call rawStore.getPartitionLocations or 
reimplement it using getPartitions?

Modified dropPartitionsAndGetLocations:
- Instead of querying every partition data. Query only the locations using the 
new interface method
- Removed partKeys parameter which become unneccessary


Diffs (updated)
-

  
itests/hcatalog-unit/src/test/java/org/apache/hive/hcatalog/listener/DummyRawStoreFailEvent.java
 ff97522 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
 b9f5fb8 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
 b3a8dd0 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/RawStore.java
 f350aa9 
  
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/cache/CachedStore.java
 d9356b8 
  
standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreControlledCommit.java
 8c3ada3 
  
standalone-metastore/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
 f98e8de 


Diff: https://reviews.apache.org/r/67485/diff/2/

Changes: https://reviews.apache.org/r/67485/diff/1-2/


Testing
---

Run the TestTablesCreateDropAlterTruncate test (partitioned table creation and 
drop)


Thanks,

Peter Vary



[jira] [Created] (HIVE-19823) BytesBytesMultiHashMap estimation should account for load size

2018-06-07 Thread Zoltan Haindrich (JIRA)
Zoltan Haindrich created HIVE-19823:
---

 Summary: BytesBytesMultiHashMap estimation should account for load 
size
 Key: HIVE-19823
 URL: https://issues.apache.org/jira/browse/HIVE-19823
 Project: Hive
  Issue Type: Improvement
Reporter: Zoltan Haindrich
Assignee: Zoltan Haindrich


it could happen that the capacity is known beforehand; and the estimated size 
of the hashtable is accurate; but still because after some time the element 
count violates loadfactor ration a rehash will occur.

this by default could happen with a {{1-loadfactor = 25%}}  probability

https://github.com/apache/hive/blob/cfd57348c1ac188e0ba131d5636a62ff7b7c27be/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/BytesBytesMultiHashMap.java#L176-L187



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)