Hi Jeff,

That is actually exactly what we're going to do. We rebuilt our cluster with 
CDH3b2 as soon as it came out, and got Hue and Flume working, and they're great!
The problem below had to be solved very quickly and involved bulk loading of a 
large amount of production logs so we could analyze them.

The plan for next month is to start building the real-time feed of logs into a 
separate cluster, which will take time because of the need to open ssh tunnels 
between the Hadoop bubble and the other environments.

Thanks
Ken

From: Jeff Hammerbacher [mailto:ham...@cloudera.com]
Sent: Friday, September 24, 2010 3:37 PM
To: hive-user@hadoop.apache.org
Subject: Re: Need a faster way to load files into Hive!

Hey Ken,

You'll probably want to use Flume to collect your log files into HDFS/Hive. See 
https://issues.cloudera.org/browse/FLUME-74 for some of the integration work 
happening at Mozilla. We'd love your help on requirements! Any comments on the 
ticket will be helpful.

Regards,
Jeff
On Fri, Sep 24, 2010 at 3:06 PM, 
<ken.barc...@wellsfargo.com<mailto:ken.barc...@wellsfargo.com>> wrote:
Hi,

We have Python code currently to load log data into Hive. The steps it follows 
are:


 1.  Copy log files (.gz files) to a staging directory in HDFS. We're mounting 
HDFS using fuse-dfs (as per http://wiki.apache.org/hadoop/MountableHDFS)
 2.  Create an external table that points to that staging directory. (The logs 
files have been rewritten to have columns corresponding to the schema of the 
table.)
 3.  Select each file's data one file at a time from the external table and do 
an INSERT OVERWRITE TABLE into the final table, which is stored as 
SEQUENCEFILE. The file's date and name are stored in partitions FILEDATE and 
FILENAME in the final table. We then drop the external table.

We use this two-table approach because that appears to be the only way to get 
your data stored as SEQUENCEFILE.

For example to load one file, we do:

CREATE EXTERNAL TABLE EXT_TEST_STG(datestamp STRING, time_stamp STRING, seq 
BIGINT, server STRING, filename STRING, logline STRING) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\011' STORED AS TEXTFILE LOCATION '/user/data/staging/'

CREATE TABLE IF NOT EXISTS newtable(datestamp STRING, time_stamp STRING, seq 
BIGINT, server STRING, filename STRING, logline STRING) PARTITIONED BY(filedate 
STRING, filename_p STRING) STORED AS SEQUENCEFILE

FROM EXT_TEST_STG ets INSERT OVERWRITE TABLE newtable PARTITION 
(filedate='2010-09-24', filename_p='trustAdmin.log.truststruts2.2010-09-17.gz') 
SELECT ets.datestamp, ets.time_stamp, ets.seq, ets.server, ets.filename, 
ets.logline WHERE ets.filename='trustAdmin.log.truststruts2.2010-09-17.gz'

The problem is, when there are a LOT of files in the staging directory, the 
INSERT OVERWRITE statement takes a *long* time to execute, presumably because 
Hive is actually reading through all those gzipped files to locate the data for 
the file indicated.

We thought of copying only one file at a time to the staging directory and 
loading one file at a time, but using fuse-dfs, it takes some time for the file 
to be available for reading once the copy is done. We could write code to poll 
the file to start the load only when it's ready, but that seems clunky.

Question: Is there are easier way to do this? Needless to say we have thousands 
and thousands of log files, so we're trying to optimize the speed of the 
loading process.

Thanks!
Ken


Reply via email to