Quoted here is a solution from Amogh:

"Try manipulating the value mapred.max.map.failures.percent to a % of 
files you expect to be corrupted / acceptable data skip percent"

But I'd also like to write up a tool to detect corrupted .gz files in hadoop 
hdfs clusters as well to exclude them from Map tasks. A time-consuming way to 
do this might be call zlib functions to unzip each .gz file and for those 
IOException is caught, list them as corrrupted. Is there a better way to do 
this? For example, is there a checksum (e.g. adler or crc) written inside of a 
.gz file and we can thus compare checksum of unzipped data with the one in the 
file?

Thanks,

Michael

--- On Sun, 2/21/10, jiang licht <[email protected]> wrote:

From: jiang licht <[email protected]>
Subject: Re: Unexpected empty result due to corrupted gz file input to Map?
To: [email protected]
Date: Sunday, February 21, 2010, 10:17 PM

Thanks, Ashutosh.

I took a quick look at source codes from the following trace and found that the 
ioexception is not handled which seems to me it will fail the whole map job, 
just as you pointed out, even many other map tasks would generate non-empty 
results. 

However, I also considered the following settings but haven't tested yet, by 
allowing specified number of failed map tasks, do you think they will help?

mapred.skip.attempts.to.start.skipping
mapred.skip.map.max.skip.records

error message->
java.io.EOFException: Unexpected end of ZLIB input stream
    at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
    at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
    at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    at 
org.apache.pig.impl.io.BufferedPositionedInputStream.read(BufferedPositionedInputStream.java:52)
    at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:125)
    at org.apache.pig.backend.executionengine.PigSlice.next(PigSlice.java:126)
    at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper$1.next(SliceWrapper.java:163)
    at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper$1.next(SliceWrapper.java:139)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)


Thanks,

Michael

--- On Sun, 2/21/10, Ashutosh Chauhan <[email protected]> wrote:

From: Ashutosh Chauhan <[email protected]>
Subject: Re: Unexpected empty result due to corrupted gz file input to Map?
To: [email protected]
Date: Sunday, February 21, 2010, 8:40 PM

Hi Michael,

gz'ed files cannot be split across maps. So, a whole gzip file will be
processed by one mapper. Now, if a gzip file is corrupted, then that map
task will keep failing and eventually hadoop will declare the whole job as
failed. So, even if you have one corrupted gzip file, hadoop (and thus Pig)
wont ignore it whole job will fail and as  a result your other gz files wont
be processed either.

In an nutshell if there is a possibility of corrupted gzip files in your
data you need to write and run a script to weed out the corrupted files
before launching a pig script.

Hope it helps,
Ashutosh

On Sat, Feb 20, 2010 at 22:47, jiang licht <[email protected]> wrote:

> I had a pig script which reads a folder of ".gz" files and perform some
> operation on the data.
>
> However, here's a problem. The folder contains some corrupted gz files and
> this causes the hadoop job generate empty result in the end, that is, all
> part-### files are zero-byte long. Though, non-empty result should be
> expected (this is tested by running against at least one good .gz file).
>
> As it turns out a corrupted .gz input to Map cause hadoop throw the
> following exception:
>
> "java.io.EOFException: Unexpected end of ZLIB input stream" were thrown.
>
> My guess is that such corrupted files will not be loaded (since the above
> exception will be
> thrown). But data from good .gz files still got loaded. Then why empty
> result is generated
> (0-sized part-####)? So, considering this situation of loading mixed good
> and corrupted ".gz"
> files, how to still get expected results?
>
> One way might be to write a map/reduce to detect each such corrupted .gz
> file and exclude it from loading into PIG. So, what is the easiest way to
> test integrity of a gz file in java, what package to use?
> But I am more interested in knowing if there a PIG solution since I guess
> it can ignore such files (but seems it is caught in trouble)? Any thoughts?
>
> Thanks!
>
> Michael
>
>
>



      


      

Reply via email to