Check out an ETL tool such as StreamSets, NiFi, Pentaho.

On Wed, Dec 26, 2018, 11:55 PM Daniel Takacs <hungarian...@hotmail.com
wrote:

> I'm working on an ETL that requires me to import a continuous stream of
> CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end
> up in the same database.table. But the newer CSVs might introduce
> additional columns (hence I want the script to alter the table and add
> additional columns as it encounters them).
>
>
>
> e.g.
>
>
>
> csv1.csv
>
> a,b
>
> 1,2
>
> 2,4
>
>
>
> csv2.csv
>
> a,b,c
>
> 3,8,0
>
> 4,10,2
>
>
>
> what is the best way to write such ETL into hive.  should I use hive with
> -f to spin up scripts like:
>
>
> upsert.hql:
>
> CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\,';
>
> SET hive.cli.errors.ignore=true;
>
> ALTER TABLE mydbname.testtable ADD COLUMNS (b string);
>
> SET hive.cli.errors.ignore=false;
>
> LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE
> mydbname.testtable;
>
>
>
> (disadvantage is that when LAD DATA encounters invalid column string for
> integer field the value NULL is inserted and I do not get notified)
>
> should I do it from beeline?
>
> should I write a pig script?
>
> should I write a java program?
>
>
> should I use programs like: https://github.com/enahwe/Csv2Hive
>
>
> what's the recommended approach here?
>
>

Reply via email to