Hi Vijay,
I built a web-based application in Python that allows users to search Solaris
syslog files by entering a date or date range, the log file type (User, Auth,
Mail, etc.) and a string or regex to search for. We use syslog-ng to aggregate
our logs.
To get all our log files into Hive, I wrote a Python program that connected to
the Hive Server, created a table partitioned by date and log type, then
executed a LOAD DATA LOCAL INPATH command on each log file. The full path of
each file was parsed to extract the date and type info so it could be placed in
the right partition. The program can take as an argument either a directory
under which a whole bunch of files exist (used to get all our logs into the
system initially) or it can take a list of individual files (run nightly to
keep the system up-to-date with the latest logs.)
I then wrote a Python CGI script that establishes a connection to the Hive
server and reads the date range to search and other info from the web
application and uses it to generate Hive QL queries using the partition
parameters. Results are returned to the browser or saved to a file on the
user's desktop (their choice.) It works great and the performance is
surprisingly fast.
One issue I came across was that I could not create a partitioned table that
uses a RegexSerDe all in one go - this would just give some "Error parsing
string" error. I got around this by breaking the command into two pieces using
ALTER TABLE:
client.execute("CREATE TABLE SYSLOG(month STRING, day STRING, time STRING,
host STRING, logline STRING) PARTITIONED BY(ds STRING, type STRING) STORED AS
TEXTFILE")
client.execute("ALTER TABLE SYSLOG SET SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES('input.regex'='^([^ ]+) {1,2}([^ ]+) ([^ ]+) ([^ ]+) (.+)$')")
Next step is to use mod_python or some other mechanism to improve the
performance further.
Now that the details of using Hive for log searching are all worked out, I'm
going to work on a system that handles much larger files (our Weblogic
application logs can be 30G/day per server, and we have a lot of servers...)
Of course we have to build out a proper cluster also - just running a
standalone instance at the moment.
I have to say the hive-user list was indispensable in getting this all to work.
Cheers
Ken
From: Vijay [mailto:[email protected]]
Sent: Monday, December 14, 2009 1:00 PM
To: [email protected]
Subject: ETL workflow experiences with Hive
Can anyone share their ETL workflow experiences with Hive? For example, how do
you transform data from log files to Hive tables? Do you use hive with
map/reduce scripts or do you use hive programmatically? Or do you do something
entirely different? I haven't found any samples or details about the
programmatic usage of hive.
Thanks in advance,
Vijay