Re: detecting corrupted parquet files

Jean-Claude Cote Wed, 23 Mar 2016 07:05:33 -0700

I have narrowed down the problem further. The ParquetReader class uses the
ParquetFileReader. That is the class which calls
FSDataInputStream.read(ByteBuffer). To read all the data.

However if you look at the ParquetFileReader found in the git repo
https://github.com/Parquet/parquet-mr/blob/master/parquet-tools/src
/main/java/parquet/tools/command/HeadCommand.java
The implementation of ParquetFileReader found there uses FSDataInputStream
.readFully(ByteBuffer).

This explains why I can't use the ParquetReader found in the drill jars.
And of course drill itself does not use this class.

I also use the ParquetWriter found in the drill jars to create parquet
file. This seems to work drill is able to query these parquet files.
However since the ParquetReader did not work I'm not as confident in using
the ParquetWriter found in drill.

What would you suggest? Safe to use the ParquetWriter found in drill?

On Wed, Mar 23, 2016 at 7:21 AM, Jean-Claude Cote <[email protected]> wrote:

> Whenever drill encounters a corrupted parquet file it will stop processing
> a query.
>
> To work around this issue I'm trying to write a simple tool to detect
> corrupted parquet files so that we can remove them from the pool of files
> drill will query on.
>
> I'm basically doing a HEAD command like was done in the parquet tools
> project.
>
> https://github.com/Parquet/parquet-mr/blob/master/parquet-tools/src
> /main/java/parquet/tools/command/HeadCommand.java
>
> PrintWriter writer = new PrintWriter(Main.out, true);
>   reader = new ParquetReader<SimpleRecord>(new Path(input), new
> SimpleReadSupport());
>   for (SimpleRecord value = reader.read(); value != null && num-- > 0;
> value = reader.read()) {
>     value.prettyPrint(writer);
>     writer.println();
>   }
>
> However when I run this on a valid parquet file in HDFS it fails. It
> works fine if the file is on local disk.
>
> I'm getting this error: can not read class org.apache.parquet.format.
> PageHeader: Required field 'uncompressed_page_size' was not found in the
> serialized data!
>
> I've narrowed down the issue with the DFSInputStream.read(ByteBuffer).
> This method gets called to read the entire file into the ByteBuffer. This
> works fine when the file is local but when it is in HDSF. when the file
> is local it uses FSInputStream.read(ByteBuffer).
>
> Instead of reading the entire file it reads only 64k. The rest of the
> ByteBuffer is all 0. I've read that 64k is the default chunk size used by
> the DFSClient. Seems related.. Any suggestions or ideas why the method
> does not read all the bytes requested.
>
> Thanks
> Jean-Claude
>
>
>
>
>
>
>

Re: detecting corrupted parquet files

Reply via email to