Re: insert data into hadoop / hive cluster

2019-01-02 Thread Daniel Takacs
Thanks the tools you pointed to were very interesting but I was hoping to 
achieve this with very little external dependencies.

I was thinking of running a script, what do you think?

CREATE TABLE IF NOT EXISTS dbname.finaltable(a string);
SET hive.cli.errors.ignore=true;
ALTER TABLE dbname.finaltable ADD COLUMNS (b decimal(38,0));
ALTER TABLE dbname.finaltable ADD COLUMNS (c decimal(38,0));
ALTER TABLE dbname.finaltable ADD COLUMNS (d string);
SET hive.cli.errors.ignore=false;

CREATE TABLE dbname.527b66e52b534d919581cae3476b8469(a decimal(38,0),b 
decimal(38,0),c string,d string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' 
LINES TERMINATED BY '\n' tblproperties("skip.header.line.count"="1");

LOAD DATA LOCAL INPATH

'/home/user/upload_tmp/73a06dcfa9d74bf8a87a9e297e623521/datatoimport.csv' 
OVERWRITE INTO TABLE dbname.527b66e52b534d919581cae3476b8469;

set mapreduce.job.queuename=myqueue;

# UNFORTUNATELY NEXT STATEMENT TAKES A BIT OF TIME SINCE I GET PLACED ON A QUEUE
INSERT INTO TABLE dbname.finaltable(a,b,c,d) select a,b,c,d FROM 
dbname.527b66e52b534d919581cae3476b8469;

DROP TABLE dbname.527b66e52b534d919581cae3476b8469



From: dam6923 
Sent: Thursday, December 27, 2018 5:08 AM
To: dev@hive.apache.org
Cc: u...@hive.apache.org
Subject: Re: insert data into hadoop / hive cluster

Check out an ETL tool such as StreamSets, NiFi, Pentaho.

On Wed, Dec 26, 2018, 11:55 PM Daniel Takacs  I'm working on an ETL that requires me to import a continuous stream of
> CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end
> up in the same database.table. But the newer CSVs might introduce
> additional columns (hence I want the script to alter the table and add
> additional columns as it encounters them).
>
>
>
> e.g.
>
>
>
> csv1.csv
>
> a,b
>
> 1,2
>
> 2,4
>
>
>
> csv2.csv
>
> a,b,c
>
> 3,8,0
>
> 4,10,2
>
>
>
> what is the best way to write such ETL into hive.  should I use hive with
> -f to spin up scripts like:
>
>
> upsert.hql:
>
> CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\,';
>
> SET hive.cli.errors.ignore=true;
>
> ALTER TABLE mydbname.testtable ADD COLUMNS (b string);
>
> SET hive.cli.errors.ignore=false;
>
> LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE
> mydbname.testtable;
>
>
>
> (disadvantage is that when LAD DATA encounters invalid column string for
> integer field the value NULL is inserted and I do not get notified)
>
> should I do it from beeline?
>
> should I write a pig script?
>
> should I write a java program?
>
>
> should I use programs like: 
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fenahwe%2FCsv2Hivedata=02%7C01%7C%7C5d82ddceb86644ab904508d66bfc7198%7C84df9e7fe9f640afb435%7C1%7C0%7C636815129307651209sdata=bHbEOyszd78GzbvTpwGpnH1VC2lvf%2BfRda7ebX%2FVcLc%3Dreserved=0
>
>
> what's the recommended approach here?
>
>


Re: insert data into hadoop / hive cluster

2018-12-27 Thread dam6923
Check out an ETL tool such as StreamSets, NiFi, Pentaho.

On Wed, Dec 26, 2018, 11:55 PM Daniel Takacs  I'm working on an ETL that requires me to import a continuous stream of
> CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end
> up in the same database.table. But the newer CSVs might introduce
> additional columns (hence I want the script to alter the table and add
> additional columns as it encounters them).
>
>
>
> e.g.
>
>
>
> csv1.csv
>
> a,b
>
> 1,2
>
> 2,4
>
>
>
> csv2.csv
>
> a,b,c
>
> 3,8,0
>
> 4,10,2
>
>
>
> what is the best way to write such ETL into hive.  should I use hive with
> -f to spin up scripts like:
>
>
> upsert.hql:
>
> CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\,';
>
> SET hive.cli.errors.ignore=true;
>
> ALTER TABLE mydbname.testtable ADD COLUMNS (b string);
>
> SET hive.cli.errors.ignore=false;
>
> LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE
> mydbname.testtable;
>
>
>
> (disadvantage is that when LAD DATA encounters invalid column string for
> integer field the value NULL is inserted and I do not get notified)
>
> should I do it from beeline?
>
> should I write a pig script?
>
> should I write a java program?
>
>
> should I use programs like: https://github.com/enahwe/Csv2Hive
>
>
> what's the recommended approach here?
>
>