Check out an ETL tool such as StreamSets, NiFi, Pentaho. On Wed, Dec 26, 2018, 11:55 PM Daniel Takacs <hungarian...@hotmail.com wrote:
> I'm working on an ETL that requires me to import a continuous stream of > CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end > up in the same database.table. But the newer CSVs might introduce > additional columns (hence I want the script to alter the table and add > additional columns as it encounters them). > > > > e.g. > > > > csv1.csv > > a,b > > 1,2 > > 2,4 > > > > csv2.csv > > a,b,c > > 3,8,0 > > 4,10,2 > > > > what is the best way to write such ETL into hive. should I use hive with > -f to spin up scripts like: > > > upsert.hql: > > CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\,'; > > SET hive.cli.errors.ignore=true; > > ALTER TABLE mydbname.testtable ADD COLUMNS (b string); > > SET hive.cli.errors.ignore=false; > > LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE > mydbname.testtable; > > > > (disadvantage is that when LAD DATA encounters invalid column string for > integer field the value NULL is inserted and I do not get notified) > > should I do it from beeline? > > should I write a pig script? > > should I write a java program? > > > should I use programs like: https://github.com/enahwe/Csv2Hive > > > what's the recommended approach here? > >