Thanks the tools you pointed to were very interesting but I was hoping to achieve this with very little external dependencies.
I was thinking of running a script, what do you think? CREATE TABLE IF NOT EXISTS dbname.finaltable(a string); SET hive.cli.errors.ignore=true; ALTER TABLE dbname.finaltable ADD COLUMNS (b decimal(38,0)); ALTER TABLE dbname.finaltable ADD COLUMNS (c decimal(38,0)); ALTER TABLE dbname.finaltable ADD COLUMNS (d string); SET hive.cli.errors.ignore=false; CREATE TABLE dbname.527b66e52b534d919581cae3476b8469(a decimal(38,0),b decimal(38,0),c string,d string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' LINES TERMINATED BY '\n' tblproperties("skip.header.line.count"="1"); LOAD DATA LOCAL INPATH '/home/user/upload_tmp/73a06dcfa9d74bf8a87a9e297e623521/datatoimport.csv' OVERWRITE INTO TABLE dbname.527b66e52b534d919581cae3476b8469; set mapreduce.job.queuename=myqueue; # UNFORTUNATELY NEXT STATEMENT TAKES A BIT OF TIME SINCE I GET PLACED ON A QUEUE INSERT INTO TABLE dbname.finaltable(a,b,c,d) select a,b,c,d FROM dbname.527b66e52b534d919581cae3476b8469; DROP TABLE dbname.527b66e52b534d919581cae3476b8469 ________________________________ From: dam6923 <dam6...@gmail.com> Sent: Thursday, December 27, 2018 5:08 AM To: dev@hive.apache.org Cc: u...@hive.apache.org Subject: Re: insert data into hadoop / hive cluster Check out an ETL tool such as StreamSets, NiFi, Pentaho. On Wed, Dec 26, 2018, 11:55 PM Daniel Takacs <hungarian...@hotmail.com wrote: > I'm working on an ETL that requires me to import a continuous stream of > CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end > up in the same database.table. But the newer CSVs might introduce > additional columns (hence I want the script to alter the table and add > additional columns as it encounters them). > > > > e.g. > > > > csv1.csv > > a,b > > 1,2 > > 2,4 > > > > csv2.csv > > a,b,c > > 3,8,0 > > 4,10,2 > > > > what is the best way to write such ETL into hive. should I use hive with > -f to spin up scripts like: > > > upsert.hql: > > CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\,'; > > SET hive.cli.errors.ignore=true; > > ALTER TABLE mydbname.testtable ADD COLUMNS (b string); > > SET hive.cli.errors.ignore=false; > > LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE > mydbname.testtable; > > > > (disadvantage is that when LAD DATA encounters invalid column string for > integer field the value NULL is inserted and I do not get notified) > > should I do it from beeline? > > should I write a pig script? > > should I write a java program? > > > should I use programs like: > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fenahwe%2FCsv2Hive&data=02%7C01%7C%7C5d82ddceb86644ab904508d66bfc7198%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636815129307651209&sdata=bHbEOyszd78GzbvTpwGpnH1VC2lvf%2BfRda7ebX%2FVcLc%3D&reserved=0 > > > what's the recommended approach here? > >