I had a pig script which reads a folder of ".gz" files and perform some
operation on the data.
However, here's a problem. The folder contains some corrupted gz files and this
causes the hadoop job generate empty result in the end, that is, all part-###
files are zero-byte long. Though, non-empty result should be expected (this is
tested by running against at least one good .gz file).
As it turns out a corrupted .gz input to Map cause hadoop throw the following
exception:
"java.io.EOFException: Unexpected end of ZLIB input stream" were thrown.
My guess is that such corrupted files will not be loaded (since the above
exception will be
thrown). But data from good .gz files still got loaded. Then why empty result
is generated
(0-sized part-####)? So, considering this situation of loading mixed good and
corrupted ".gz"
files, how to still get expected results?
One way might be to write a map/reduce to detect each such corrupted .gz file
and exclude it from loading into PIG. So, what is the easiest way to test
integrity of a gz file in java, what package to use?
But I am more interested in knowing if there a PIG solution since I guess it
can ignore such files (but seems it is caught in trouble)? Any thoughts?
Thanks!
Michael