[GitHub] hive pull request #511: HIVE-21078: Replicate column and table level statist...

2019-01-02 Thread ashutosh-bapat
GitHub user ashutosh-bapat opened a pull request:

https://github.com/apache/hive/pull/511

HIVE-21078: Replicate column and table level statistics for unpartitioned 
Hive tables

@maheshk114, @sankarh can you please review?



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ashutosh-bapat/hive hive21078

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/511.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #511


commit db98502a44f69f255924231b03e2145248c9be0f
Author: Ashutosh Bapat 
Date:   2018-12-19T04:49:29Z

HIVE-21078: Replicate column and table level statistics for unpartitioned 
Hive tables

The column statistics is included as part of the Table object during 
bootstrap dump and loaded when
corresponding table is created on replica.

During incremental dump and load, UpdateTableColStats event is used to 
replicate the statistics.

In both the cases, the statistics is replicated only when the data is 
replicated.

Ashutosh Bapat




---


Re: insert data into hadoop / hive cluster

2019-01-02 Thread Daniel Takacs
Thanks the tools you pointed to were very interesting but I was hoping to 
achieve this with very little external dependencies.

I was thinking of running a script, what do you think?

CREATE TABLE IF NOT EXISTS dbname.finaltable(a string);
SET hive.cli.errors.ignore=true;
ALTER TABLE dbname.finaltable ADD COLUMNS (b decimal(38,0));
ALTER TABLE dbname.finaltable ADD COLUMNS (c decimal(38,0));
ALTER TABLE dbname.finaltable ADD COLUMNS (d string);
SET hive.cli.errors.ignore=false;

CREATE TABLE dbname.527b66e52b534d919581cae3476b8469(a decimal(38,0),b 
decimal(38,0),c string,d string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' 
LINES TERMINATED BY '\n' tblproperties("skip.header.line.count"="1");

LOAD DATA LOCAL INPATH

'/home/user/upload_tmp/73a06dcfa9d74bf8a87a9e297e623521/datatoimport.csv' 
OVERWRITE INTO TABLE dbname.527b66e52b534d919581cae3476b8469;

set mapreduce.job.queuename=myqueue;

# UNFORTUNATELY NEXT STATEMENT TAKES A BIT OF TIME SINCE I GET PLACED ON A QUEUE
INSERT INTO TABLE dbname.finaltable(a,b,c,d) select a,b,c,d FROM 
dbname.527b66e52b534d919581cae3476b8469;

DROP TABLE dbname.527b66e52b534d919581cae3476b8469



From: dam6923 
Sent: Thursday, December 27, 2018 5:08 AM
To: dev@hive.apache.org
Cc: u...@hive.apache.org
Subject: Re: insert data into hadoop / hive cluster

Check out an ETL tool such as StreamSets, NiFi, Pentaho.

On Wed, Dec 26, 2018, 11:55 PM Daniel Takacs  I'm working on an ETL that requires me to import a continuous stream of
> CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end
> up in the same database.table. But the newer CSVs might introduce
> additional columns (hence I want the script to alter the table and add
> additional columns as it encounters them).
>
>
>
> e.g.
>
>
>
> csv1.csv
>
> a,b
>
> 1,2
>
> 2,4
>
>
>
> csv2.csv
>
> a,b,c
>
> 3,8,0
>
> 4,10,2
>
>
>
> what is the best way to write such ETL into hive.  should I use hive with
> -f to spin up scripts like:
>
>
> upsert.hql:
>
> CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\,';
>
> SET hive.cli.errors.ignore=true;
>
> ALTER TABLE mydbname.testtable ADD COLUMNS (b string);
>
> SET hive.cli.errors.ignore=false;
>
> LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE
> mydbname.testtable;
>
>
>
> (disadvantage is that when LAD DATA encounters invalid column string for
> integer field the value NULL is inserted and I do not get notified)
>
> should I do it from beeline?
>
> should I write a pig script?
>
> should I write a java program?
>
>
> should I use programs like: 
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fenahwe%2FCsv2Hivedata=02%7C01%7C%7C5d82ddceb86644ab904508d66bfc7198%7C84df9e7fe9f640afb435%7C1%7C0%7C636815129307651209sdata=bHbEOyszd78GzbvTpwGpnH1VC2lvf%2BfRda7ebX%2FVcLc%3Dreserved=0
>
>
> what's the recommended approach here?
>
>


Re: [jira] [Commented] (HIVE-18884) Simplify Logging in Hive Metastore Client

2019-01-02 Thread Mani M
Hi Peter
Thanks for the info.
I hv tried with patch 03,04 and 05. The test cases are getting failed in
the different classes other than the class file changed.
How can we avoid it, any suggestions.
With Regards
M.Mani
+61 432 461 087


On Wed, 2 Jan 2019, 02:19 Peter Vary (JIRA) 
> [
> https://issues.apache.org/jira/browse/HIVE-18884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731615#comment-16731615
> ]
>
> Peter Vary commented on HIVE-18884:
> ---
>
> [~rmsm...@gmail.com]: Try a few times if you think they are not related.
> We have too many falky tests :(
>
> Thanks and welcome to Hive :)
> Peter
>
> > Simplify Logging in Hive Metastore Client
> > -
> >
> > Key: HIVE-18884
> > URL: https://issues.apache.org/jira/browse/HIVE-18884
> > Project: Hive
> >  Issue Type: Improvement
> >  Components: Standalone Metastore
> >Affects Versions: 3.0.0
> >Reporter: BELUGA BEHR
> >Assignee: Mani M
> >Priority: Minor
> >  Labels: noob
> > Attachments: HIVE.18884.02.patch, HIVE.18884.03.patch,
> HIVE.18884.04.patch, HIVE.18884.05.patch, HIVE.18884.patch
> >
> >
> >
> https://github.com/apache/hive/blob/4047befe48c8f762c58d8854e058385c1df151c6/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
> > The current logging is:
> > {code}
> > 2018-02-26 07:02:44,883  INFO  hive.metastore:
> [HiveServer2-Handler-Pool: Thread-65]: Trying to connect to metastore with
> URI thrift://host.company.com:9083
> > 2018-02-26 07:02:44,892  INFO  hive.metastore:
> [HiveServer2-Handler-Pool: Thread-65]: Connected to metastore.
> > 2018-02-26 07:02:44,892  INFO  hive.metastore:
> [HiveServer2-Handler-Pool: Thread-65]: Opened a connection to metastore,
> current connections: 2
> > {code}
> > Please simplify to something like:
> > {code}
> > 2018-02-26 07:02:44,892  INFO  hive.metastore:
> [HiveServer2-Handler-Pool: Thread-65]: Opened a connection to the Metastore
> Server (URI thrift://host.company.com:9083), current connections: 2
> > ... or ...
> > 2018-02-26 07:02:44,892  ERROR  hive.metastore:
> [HiveServer2-Handler-Pool: Thread-65]: Failed to connect to the Metastore
> Server (URI thrift://host.company.com:9083)
> > {code}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


Re: Review Request 69642: HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve get_partition performance

2019-01-02 Thread Karthik Manamcheri via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69642/#review211625
---




standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestGetPartitions.java
Lines 374 (patched)


The behavior of the getPartitions\* changed to be more inline with how the 
other getTable/getDatabase calls work.

Before this change, if you issue a getPartitionsByNames with an empty 
database, we threw an exception. After this change, we will return an empty 
list of partitions instead. This behavior is similar to what happens if you 
issue a getTablesByNames call (an empty list of tables are returned)


- Karthik Manamcheri


On Jan. 3, 2019, 1:40 a.m., Karthik Manamcheri wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69642/
> ---
> 
> (Updated Jan. 3, 2019, 1:40 a.m.)
> 
> 
> Review request for hive, Adam Holley, Na Li, Morio Ramdenbourg, Naveen 
> Gangam, Peter Vary, Sergio Pena, and Vihang Karajgaonkar.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve 
> get_partition performance
> 
> 
> Diffs
> -
> 
>   
> standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
>  a9398ae1e7 
>   
> standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/events/PreReadTableEvent.java
>  beec72bc12 
>   
> standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/utils/ThrowingSupplier.java
>  PRE-CREATION 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java
>  7429d18226 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestMetaStoreEventListener.java
>  fe64a91b56 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestGetPartitions.java
>  4d7f7c1220 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestListPartitions.java
>  a338bd4032 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/events/TestPreReadTableEvent.java
>  PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/69642/diff/4/
> 
> 
> Testing
> ---
> 
> Unit tests.
> Manual performance test with Cloudera BDR to notice improved backup 
> performance.
> 
> 
> Thanks,
> 
> Karthik Manamcheri
> 
>



Re: Review Request 69642: HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve get_partition performance

2019-01-02 Thread Karthik Manamcheri via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69642/
---

(Updated Jan. 3, 2019, 1:40 a.m.)


Review request for hive, Adam Holley, Na Li, Morio Ramdenbourg, Naveen Gangam, 
Peter Vary, Sergio Pena, and Vihang Karajgaonkar.


Repository: hive-git


Description
---

HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve 
get_partition performance


Diffs (updated)
-

  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
 a9398ae1e7 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/events/PreReadTableEvent.java
 beec72bc12 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/utils/ThrowingSupplier.java
 PRE-CREATION 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java
 7429d18226 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestMetaStoreEventListener.java
 fe64a91b56 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestGetPartitions.java
 4d7f7c1220 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestListPartitions.java
 a338bd4032 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/events/TestPreReadTableEvent.java
 PRE-CREATION 


Diff: https://reviews.apache.org/r/69642/diff/4/

Changes: https://reviews.apache.org/r/69642/diff/3-4/


Testing
---

Unit tests.
Manual performance test with Cloudera BDR to notice improved backup performance.


Thanks,

Karthik Manamcheri



Re: Review Request 69642: HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve get_partition performance

2019-01-02 Thread Karthik Manamcheri via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69642/
---

(Updated Jan. 2, 2019, 8:32 p.m.)


Review request for hive, Adam Holley, Na Li, Morio Ramdenbourg, Naveen Gangam, 
Peter Vary, Sergio Pena, and Vihang Karajgaonkar.


Repository: hive-git


Description
---

HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve 
get_partition performance


Diffs (updated)
-

  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
 a9398ae1e7 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/events/PreReadTableEvent.java
 beec72bc12 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/utils/ThrowingSupplier.java
 PRE-CREATION 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestMetaStoreEventListener.java
 fe64a91b56 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/events/TestPreReadTableEvent.java
 PRE-CREATION 


Diff: https://reviews.apache.org/r/69642/diff/3/

Changes: https://reviews.apache.org/r/69642/diff/2-3/


Testing
---

Unit tests.
Manual performance test with Cloudera BDR to notice improved backup performance.


Thanks,

Karthik Manamcheri



Re: Review Request 69642: HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve get_partition performance

2019-01-02 Thread Karthik Manamcheri via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69642/#review211612
---




standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
Line 4551 (original), 4554 (patched)


Changed it to NoSuchObjectException. I don't recall why I changed it to a 
MetaException, but it doesn't need to change.

Regarding the API change, I thought about it more and changed code to not 
actually change any semantics. With the new change, we'll throw a 
NoSuchObjectException here which will get wrapped into a RuntimeException to be 
thrown. This will be called when the pre-event listener is fired. At that 
point, we can catch the RuntimeException, and rethrow the NoSuchObjectException 
(as it was doing before this change).


- Karthik Manamcheri


On Jan. 2, 2019, 8:15 p.m., Karthik Manamcheri wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69642/
> ---
> 
> (Updated Jan. 2, 2019, 8:15 p.m.)
> 
> 
> Review request for hive, Adam Holley, Na Li, Morio Ramdenbourg, Naveen 
> Gangam, Peter Vary, Sergio Pena, and Vihang Karajgaonkar.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve 
> get_partition performance
> 
> 
> Diffs
> -
> 
>   
> standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
>  a9398ae1e7 
>   
> standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/events/PreReadTableEvent.java
>  beec72bc12 
>   
> standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/utils/ThrowingSupplier.java
>  PRE-CREATION 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestMetaStoreEventListener.java
>  fe64a91b56 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/events/TestPreReadTableEvent.java
>  PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/69642/diff/2/
> 
> 
> Testing
> ---
> 
> Unit tests.
> Manual performance test with Cloudera BDR to notice improved backup 
> performance.
> 
> 
> Thanks,
> 
> Karthik Manamcheri
> 
>



Re: Review Request 69642: HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve get_partition performance

2019-01-02 Thread Karthik Manamcheri via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69642/
---

(Updated Jan. 2, 2019, 8:15 p.m.)


Review request for hive, Adam Holley, Na Li, Morio Ramdenbourg, Naveen Gangam, 
Peter Vary, Sergio Pena, and Vihang Karajgaonkar.


Repository: hive-git


Description
---

HIVE-20977: Lazy evaluate the table object in PreReadTableEvent to improve 
get_partition performance


Diffs
-

  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
 a9398ae1e7 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/events/PreReadTableEvent.java
 beec72bc12 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/utils/ThrowingSupplier.java
 PRE-CREATION 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestMetaStoreEventListener.java
 fe64a91b56 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/events/TestPreReadTableEvent.java
 PRE-CREATION 


Diff: https://reviews.apache.org/r/69642/diff/2/


Testing
---

Unit tests.
Manual performance test with Cloudera BDR to notice improved backup performance.


Thanks,

Karthik Manamcheri



[jira] [Created] (HIVE-21081) DATE_FORMAT incorrectly returns results on the last week of the calendar year

2019-01-02 Thread Wilson Lu (JIRA)
Wilson Lu created HIVE-21081:


 Summary: DATE_FORMAT incorrectly returns results on the last week 
of the calendar year
 Key: HIVE-21081
 URL: https://issues.apache.org/jira/browse/HIVE-21081
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 2.3.2, 2.1.1, 2.3.3
Reporter: Wilson Lu


The hive DATE_FORMAT does not perform the correct operation on the last week of 
the calendar year. The following statements incorrectly formats the data:

select DATE_FORMAT('2017-12-31', 'MM')

select DATE_FORMAT('2018-12-30', 'MM')
select DATE_FORMAT('2018-12-31', 'MM')

select DATE_FORMAT('2019-12-29', 'MM')
select DATE_FORMAT('2019-12-30', 'MM')
select DATE_FORMAT('2019-12-31', 'MM')

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21080) Update Hive to use ORC-1.5.4

2019-01-02 Thread Vaibhav Gumashta (JIRA)
Vaibhav Gumashta created HIVE-21080:
---

 Summary: Update Hive to use ORC-1.5.4
 Key: HIVE-21080
 URL: https://issues.apache.org/jira/browse/HIVE-21080
 Project: Hive
  Issue Type: Bug
  Components: ORC
Reporter: Vaibhav Gumashta


Now that ORC-1.5.4 is released, we should update Hive's version of ORC so that 
HIVE-20699 can use it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21079) Replicate column statistics for partitions of partitioned Hive table.

2019-01-02 Thread Ashutosh Bapat (JIRA)
Ashutosh Bapat created HIVE-21079:
-

 Summary: Replicate column statistics for partitions of partitioned 
Hive table.
 Key: HIVE-21079
 URL: https://issues.apache.org/jira/browse/HIVE-21079
 Project: Hive
  Issue Type: Sub-task
Reporter: Ashutosh Bapat
Assignee: Ashutosh Bapat


This task is for replicating statistics for partitions of a partitioned Hive 
table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21078) Replicate table level column statistics for Hive tables

2019-01-02 Thread Ashutosh Bapat (JIRA)
Ashutosh Bapat created HIVE-21078:
-

 Summary: Replicate table level column statistics for Hive tables
 Key: HIVE-21078
 URL: https://issues.apache.org/jira/browse/HIVE-21078
 Project: Hive
  Issue Type: Sub-task
Reporter: Ashutosh Bapat
Assignee: Ashutosh Bapat


This task is for replicating table level statistics. Partition level statistics 
will be worked upon in a separate sub-task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)