Modifying data before importation into Hive

Ken.Barclay Tue, 24 Nov 2009 15:24:52 -0800

Hello,

I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a 
Centos machine.


I've been able to load syslog files into Hive using the RegexSerDe class - this 
works great. But what if your log files are missing a column, or the data needs 
to be manipulated in some way before being put in the table? In our case, we'd 
like to add a YEAR column as it's not included in the log files. We'd like to 
avoid having to rewrite all the logs to put them in that format though.

One suggestion from Ashish to a user was to do something like a left outer join 
with data staged in another table and to target the results into a table with 
the desired structure. But the lines of our log file don't have a unique key we 
could use to do such a join - just things like host, day, month, etc.

Is there any other way to add information in conjunction with doing LOAD DATA 
INPATH, given that we can't add data after it's in the table?

Thanks
Ken

Modifying data before importation into Hive

Reply via email to