Hi Adam,

Thanks for the suggestion. We did want to keep both filename and filedate 
partitions though, for efficiency of querying later.

I did work out a fix to the problem however that was simple but that eliminated 
all the extra latency.

Instead of staging all the files to be loaded into *one* directory and loading 
them from there, I now use two directories.

1) /user/data/staging - for all the files to be loaded
2) /user/data/load - temporary holding dir, for just one file at a time.

At loading time I copy one file at a time to /user/data/load, load it from 
there, then delete it and copy the next file, and so on. Loading in this way is 
now almost instantaneous for small files, and we're getting extremely good 
throughput for huge production logs.

Importantly, I also do not use fuse in my automation programs except to check 
for the existence of files or directories or other unix admin tasks. Getting 
files from the local file system into HDFS should *only* be done with hadoop 
dfs -copyFromLocal, because it takes time for HDFS to replicate a file 
transferred into it with fuse.

Cheers!
Ken


-----Original Message-----
From: Adam O'Donnell [mailto:a...@immunet.com] 
Sent: Friday, September 24, 2010 4:05 PM
To: hive-user@hadoop.apache.org
Subject: Re: Need a faster way to load files into Hive!

Ken:

Why not merge all the files into one partition, namely the date
partition?  Put all the logs you want to read into the directory, then
do your insert, but ignore the file name past that point.

On Fri, Sep 24, 2010 at 3:06 PM,  <ken.barc...@wellsfargo.com> wrote:
> Hi,
>
> We have Python code currently to load log data into Hive. The steps it
> follows are:
>
>
> Copy log files (.gz files) to a staging directory in HDFS. We're mounting
> HDFS using fuse-dfs (as per http://wiki.apache.org/hadoop/MountableHDFS)
> Create an external table that points to that staging directory. (The logs
> files have been rewritten to have columns corresponding to the schema of the
> table.)
> Select each file's data one file at a time from the external table and do an
> INSERT OVERWRITE TABLE into the final table, which is stored as
> SEQUENCEFILE. The file's date and name are stored in partitions FILEDATE and
> FILENAME in the final table. We then drop the external table.
>
>
> We use this two-table approach because that appears to be the only way to
> get your data stored as SEQUENCEFILE.
>
> For example to load one file, we do:
>
> CREATE EXTERNAL TABLE EXT_TEST_STG(datestamp STRING, time_stamp STRING, seq
> BIGINT, server STRING, filename STRING, logline STRING) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\011' STORED AS TEXTFILE LOCATION
> '/user/data/staging/'
>
> CREATE TABLE IF NOT EXISTS newtable(datestamp STRING, time_stamp STRING, seq
> BIGINT, server STRING, filename STRING, logline STRING) PARTITIONED
> BY(filedate STRING, filename_p STRING) STORED AS SEQUENCEFILE
>
> FROM EXT_TEST_STG ets INSERT OVERWRITE TABLE newtable PARTITION
> (filedate='2010-09-24',
> filename_p='trustAdmin.log.truststruts2.2010-09-17.gz') SELECT
> ets.datestamp, ets.time_stamp, ets.seq, ets.server, ets.filename,
> ets.logline WHERE ets.filename='trustAdmin.log.truststruts2.2010-09-17.gz'
>
> The problem is, when there are a LOT of files in the staging directory, the
> INSERT OVERWRITE statement takes a *long* time to execute, presumably
> because Hive is actually reading through all those gzipped files to locate
> the data for the file indicated.
>
> We thought of copying only one file at a time to the staging directory and
> loading one file at a time, but using fuse-dfs, it takes some time for the
> file to be available for reading once the copy is done. We could write code
> to poll the file to start the load only when it's ready, but that seems
> clunky.
>
> Question: Is there are easier way to do this? Needless to say we have
> thousands and thousands of log files, so we're trying to optimize the speed
> of the loading process.
>
> Thanks!
> Ken
>



-- 
Adam J. O'Donnell, Ph.D.
Immunet Corporation
Cell: +1 (267) 251-0070

Reply via email to