Hi all, Has anyone been having issues with Hadoop jobs involving a large collection of gzipped files, specifically on EC2? I currently have a job set up in which I take as input about 360 gzipped log files within HDFS, totalling to about 30GB of compressed data. I have noticed that if I leave these files compressed, the JobClient hangs at 0% map, 0% reduce, and will eventually jump to 100% map and reduce (after a rather long time) without having reported any progress in between.
However, if I unzip the files before pushing them to HDFS, the job starts almost immediately. I currently have a Cascading script that unzips the log files from a local directory while pushing them into HDFS. This approach, however, is rather brute force and takes an incredibly long time. Has anyone else witnessed this kind of behavior? I'm running the job up on EC2 using 10 large instances and version 0.17.0, but I've noticed this issue locally as well. Is there, perhaps, a flag I'm not setting in my JobConf, or an idiosyncrasy within the way gzipped files are read within HDFS that would account for this? I haven't noticed anything unusual in my code or in the logs that would signify a problem. I know this question is rather vague, but any help/input would be greatly appreciated. Thanks, Mike -- View this message in context: http://www.nabble.com/Has-anyone-experienced-a-major-lag-reading-gzipped-files--tp19058967p19058967.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
