Hi Vijay,

I built a web-based application in Python that allows users to search Solaris 
syslog files by entering a date or date range, the log file type (User, Auth, 
Mail, etc.) and a string or regex to search for. We use syslog-ng to aggregate 
our logs.

To get all our log files into Hive, I wrote a Python program that connected to 
the Hive Server, created a table partitioned by date and log type, then 
executed a LOAD DATA LOCAL INPATH command on each log file. The full path of 
each file was parsed to extract the date and type info so it could be placed in 
the right partition. The program can take as an argument either a directory 
under which a whole bunch of files exist (used to get all our logs into the 
system initially) or it can take a list of individual files (run nightly to 
keep the system up-to-date with the latest logs.)

I then wrote a Python CGI script that establishes a connection to the Hive 
server and reads the date range to search and other info from the web 
application and uses it to generate Hive QL queries using the partition 
parameters. Results are returned to the browser or saved to a file on the 
user's desktop (their choice.) It works great and the performance is 
surprisingly fast.

One issue I came across was that I could not create a partitioned table that 
uses a RegexSerDe all in one go - this would just give some "Error parsing 
string" error. I got around this by breaking the command into two pieces using 
ALTER TABLE:

   client.execute("CREATE TABLE SYSLOG(month STRING, day STRING, time STRING, 
host STRING, logline STRING) PARTITIONED BY(ds STRING, type STRING) STORED AS 
TEXTFILE")

    client.execute("ALTER TABLE SYSLOG SET SERDE 
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH 
SERDEPROPERTIES('input.regex'='^([^ ]+) {1,2}([^ ]+) ([^ ]+) ([^ ]+) (.+)$')")

Next step is to use mod_python or some other mechanism to improve the 
performance further.

Now that the details of using Hive for log searching are all worked out, I'm 
going to work on a system that handles much larger files (our Weblogic 
application logs can be 30G/day per server, and we have a lot of servers...)

Of course we have to build out a proper cluster also - just running a 
standalone instance at the moment.

I have to say the hive-user list was indispensable in getting this all to work.

Cheers
Ken
From: Vijay [mailto:[email protected]]
Sent: Monday, December 14, 2009 1:00 PM
To: [email protected]
Subject: ETL workflow experiences with Hive

Can anyone share their ETL workflow experiences with Hive? For example, how do 
you transform data from log files to Hive tables? Do you use hive with 
map/reduce scripts or do you use hive programmatically? Or do you do something 
entirely different? I haven't found any samples or details about the 
programmatic usage of hive.

Thanks in advance,
Vijay

Reply via email to