I had a pig script which reads a folder of ".gz" files and perform some 
operation on the data.

However, here's a problem. The folder contains some corrupted gz files and this 
causes the hadoop job generate empty result in the end, that is, all part-### 
files are zero-byte long. Though, non-empty result should be expected (this is 
tested by running against at least one good .gz file).

As it turns out a corrupted .gz input to Map cause hadoop throw the following 
exception:

"java.io.EOFException: Unexpected end of ZLIB input stream" were thrown.

My guess is that such corrupted files will not be loaded (since the above 
exception will be
thrown). But data from good .gz files still got loaded. Then why empty result 
is generated
(0-sized part-####)? So, considering this situation of loading mixed good and 
corrupted ".gz"
files, how to still get expected results? 

One way might be to write a map/reduce to detect each such corrupted .gz file 
and exclude it from loading into PIG. So, what is the easiest way to test 
integrity of a gz file in java, what package to use?
But I am more interested in knowing if there a PIG solution since I guess it 
can ignore such files (but seems it is caught in trouble)? Any thoughts?

Thanks!

Michael


      

Reply via email to