Has anyone experienced a major lag reading gzipped files?

montag Tue, 19 Aug 2008 14:23:29 -0700

Hi all,

  Has anyone been having issues with Hadoop jobs involving a large
collection of gzipped files, specifically on EC2?  I currently have a job
set up in which I take as input about 360 gzipped log files within HDFS,
totalling to about 30GB of compressed data.  I have noticed that if I leave
these files compressed, the JobClient hangs at 0% map, 0% reduce, and will
eventually jump to 100% map and reduce (after a rather long time) without
having reported any progress in between.


However, if I unzip the files before pushing them to HDFS, the job starts
almost immediately.  I currently have a Cascading script that unzips the log
files from a local directory while pushing them into HDFS.  This approach,
however, is rather brute force and takes an incredibly long time.  

Has anyone else witnessed this kind of behavior?  I'm running the job up on
EC2 using 10 large instances and version 0.17.0, but I've noticed this issue
locally as well.  Is there, perhaps, a flag I'm not setting in my JobConf,
or an idiosyncrasy  within the way gzipped files are read within HDFS that
would account for this?  I haven't noticed anything unusual in my code or in
the logs that would signify a problem.

I know this question is rather vague, but any help/input would be greatly
appreciated.

Thanks,
Mike
-- 
View this message in context: 
http://www.nabble.com/Has-anyone-experienced-a-major-lag-reading-gzipped-files--tp19058967p19058967.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Has anyone experienced a major lag reading gzipped files?

Reply via email to