Re: parquet data corruption

Cheng Lian Thu, 21 Apr 2016 21:34:50 -0700

(cc [email protected])

Hey Shushant,

This kind of error can be tricky to debug. Could you please provide thefollowing information:

- The tool used to write those Parquet files (possibly Hive 0.13 sinceyou mentioned hive-exec 0.13?)- The tool used to read those Parquet files (should be Hive according tothe stack trace, but what version?)

- What is the "complex" query?

- Schema of those Parquet files (can be checked using parquet-tools), aswell as corresponding schema of the user application (table schema for Hive)

- If possible, code snippet you used to write the files

- Are there files of different schemata mixed up? Some tools, like Hive,don't handle schema evolution well.

I saw the file name in the stack trace consists of a timestamp. Thisisn't the naming convention used by Hive. Did you move files writtensomewhere else to the target directory?


Cheng

On 4/22/16 10:56 AM, Shushant Arora wrote:

Hi
I am writing to a parquet tableusing parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13).Data is being written correctly and when I do count(1) or select *with limit I get proper result.
But when I do some complex query on table it throws below excpetion :

Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 18 in block0 in filehdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquetatorg.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)atorg.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)atorg.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255)atorg.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170)atorg.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)atorg.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: parquet.io.ParquetDecodingException:Can not read value at 18 in block 0 in file
hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
atorg.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)atorg.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)atorg.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:344)atorg.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)atorg.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)atorg.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:122)atorg.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:253)
        ... 11 more
Caused by: parquet.io.ParquetDecodingException: Can not read value at18 in block 0 in file
hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
atparquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:216)atparquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144)atorg.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)atorg.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:48)atorg.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:339)
        ... 15 more
Caused by: parquet.io.ParquetDecodingException: Can't read value incolumn [sessionid] BINARY at value 18 out of 18, 18 out of 18 incurrentPage. repetition level: 0, definition level: 1atparquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:450)atparquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:352)atparquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)atparquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:197)
        ... 19 more
Caused by: parquet.io.ParquetDecodingException: could not read bytesat offset 726atparquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:43)atparquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:295)atparquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:446)
        ... 22 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 726
atparquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:54)atparquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:36)
        ... 24 more
FAILED: Execution Error, return code 2 fromorg.apache.hadoop.hive.ql.exec.mr.MapRedTask
Whats the reason of this error ? Why data is being getting corruptedwhile reading.
Thanks

Re: parquet data corruption

Reply via email to