RE: Modifying data before importation into Hive

Ashish Thusoo Tue, 24 Nov 2009 16:13:57 -0800

For this particular case, the LOAD DATA approach is better as that really does 
not do any transformation to the data and yet adds the information that you 
want as a partitioning column, So I would recommend that and not the SELECT 
TRANSFORM approach.


Regarding the SELECT TRANSFORM approach, it would lead to a full scan and 
transform of the data and keep redundant values in the data itself so probably 
an overkill for your particular case.

Ashish

________________________________
From: [email protected] [mailto:[email protected]]
Sent: Tuesday, November 24, 2009 4:08 PM
To: [email protected]
Subject: RE: Modifying data before importation into Hive

Ashish,

We just know we want to add it in because it doesn't exist. It will have the 
static value 2009 for this year's files, and starting next year we'll put in 
the value 2010.

I was thinking I could write a Python script that will simply load the files 
one at a time, using the date as a partition, like in the README:

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites 
PARTITION (ds='2008-08-15');

Ultimately we want to do Hive queries to search for certain patterns that occur 
in the logs lines between certain dates, so we could include the partition in 
the search to do that.

Would using SELECT TRANSFORM with a python script be another approach?

-Ken

From: Ashish Thusoo [mailto:[email protected]]
Sent: Tuesday, November 24, 2009 3:56 PM
To: [email protected]
Subject: RE: Modifying data before importation into Hive

How are you generating the value in the YEAR column? Is it a static value or 
something that gets computed from the data?

Ashish

________________________________
From: [email protected] [mailto:[email protected]]
Sent: Tuesday, November 24, 2009 3:24 PM
To: [email protected]
Subject: Modifying data before importation into Hive
Hello,

I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a 
Centos machine.

I've been able to load syslog files into Hive using the RegexSerDe class - this 
works great. But what if your log files are missing a column, or the data needs 
to be manipulated in some way before being put in the table? In our case, we'd 
like to add a YEAR column as it's not included in the log files. We'd like to 
avoid having to rewrite all the logs to put them in that format though.

One suggestion from Ashish to a user was to do something like a left outer join 
with data staged in another table and to target the results into a table with 
the desired structure. But the lines of our log file don't have a unique key we 
could use to do such a join - just things like host, day, month, etc.

Is there any other way to add information in conjunction with doing LOAD DATA 
INPATH, given that we can't add data after it's in the table?

Thanks
Ken

RE: Modifying data before importation into Hive

Reply via email to