Yep I will, seemed like a bug to me too. Thanks, Nezih
On Thu, Jun 18, 2015 at 1:33 PM, Ryan Blue <[email protected]> wrote: > The first issue looks like the delta byte array problem: > > https://issues.apache.org/jira/browse/PARQUET-246 > > The second one looks like the write side uses delta_byte_array for fixed, > but the read side doesn't expect it. File a bug? > > rb > > On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote: > >> Hi all, >> >> I have generated some test data using the method here >> < >> https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68 >> >. >> >> What I notice is if I use WriterVersion.PARQUET_2_0, the default block and >> page sizes, and GZIP compression (test case 1 below) I cannot read the >> file >> with parquet-tools dump (see stack trace below). When I switch to >> PARQUET_1_0 (test case 2 below) I can use dump tool to read the data. >> Weird >> enough when I reduce the number of rows I create to 1K and use PARQUET_2_0 >> writer again (test case 3) dump still fails but with a different >> exception. >> >> Are these known issues? >> >> Nezih >> Test Case 1 [FAILS] >> >> WriterVersion.PARQUET_2_0 >> default block and page size >> GZIP compression >> 1M rows >> >> Schema: >> >> file schema: test >> >> -------------------------------------------------------------------------------- >> binary_field: REQUIRED BINARY R:0 D:0 >> int32_field: REQUIRED INT32 R:0 D:0 >> int64_field: REQUIRED INT64 R:0 D:0 >> boolean_field: REQUIRED BOOLEAN R:0 D:0 >> float_field: REQUIRED FLOAT R:0 D:0 >> double_field: REQUIRED DOUBLE R:0 D:0 >> flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0 >> int96_field: REQUIRED INT96 R:0 D:0 >> >> row group 1: RC:1000000 TS:38744008 OFFSET:4 >> >> -------------------------------------------------------------------------------- >> binary_field: BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77 >> VC:1000000 ENC:DELTA_BYTE_ARRAY >> int32_field: INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06 >> VC:1000000 ENC:DELTA_BINARY_PACKED >> int64_field: INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72 >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY >> boolean_field: BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000 >> ENC:RLE >> float_field: FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67 >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY >> double_field: DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72 >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY >> flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593 >> SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY >> int96_field: INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75 >> VC:1000000 ENC:PLAIN,RLE_DICTIONARY >> >> parquet-tools dump fails with: >> >> value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't >> read value in column [binary_field] BINARY at value 377601 out of >> 1000000, 1 out of 23600 in currentPage. repetition level: 0, >> definition level: 0 >> at >> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462) >> at >> parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410) >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288) >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215) >> at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136) >> at parquet.tools.Main.main(Main.java:219) >> Caused by: java.lang.ArrayIndexOutOfBoundsException >> at >> parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70) >> at >> parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307) >> at >> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458) >> ... 5 more >> Can't read value in column [binary_field] BINARY at value 377601 out >> of 1000000, 1 out of 23600 in currentPage. repetition level: 0, >> definition level: 0 >> >> Test Case 2 [SUCCEEDS] >> >> WriterVersion.PARQUET_1_0 >> default block and page size >> GZIP compression >> 1M rows >> >> Schema: >> >> file schema: test >> >> -------------------------------------------------------------------------------- >> binary_field: REQUIRED BINARY R:0 D:0 >> int32_field: REQUIRED INT32 R:0 D:0 >> int64_field: REQUIRED INT64 R:0 D:0 >> boolean_field: REQUIRED BOOLEAN R:0 D:0 >> float_field: REQUIRED FLOAT R:0 D:0 >> double_field: REQUIRED DOUBLE R:0 D:0 >> flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0 >> int96_field: REQUIRED INT96 R:0 D:0 >> >> row group 1: RC:1000000 TS:1070161196 OFFSET:4 >> >> -------------------------------------------------------------------------------- >> binary_field: BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83 >> VC:1000000 ENC:PLAIN,BIT_PACKED >> int32_field: INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89 >> VC:1000000 ENC:PLAIN,BIT_PACKED >> int64_field: INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69 >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED >> boolean_field: BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06 >> VC:1000000 ENC:PLAIN,BIT_PACKED >> float_field: FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63 >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED >> double_field: DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69 >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED >> flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106 >> SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED >> int96_field: INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73 >> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED >> >> Test Case 3 [FAILS] >> >> WriterVersion.PARQUET_2_0 >> default block and page size >> GZIP compression >> 1K rows >> >> Schema: >> >> file schema: test >> >> -------------------------------------------------------------------------------- >> binary_field: REQUIRED BINARY R:0 D:0 >> int32_field: REQUIRED INT32 R:0 D:0 >> int64_field: REQUIRED INT64 R:0 D:0 >> boolean_field: REQUIRED BOOLEAN R:0 D:0 >> float_field: REQUIRED FLOAT R:0 D:0 >> double_field: REQUIRED DOUBLE R:0 D:0 >> flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0 >> int96_field: REQUIRED INT96 R:0 D:0 >> >> row group 1: RC:1000 TS:40502 OFFSET:4 >> >> -------------------------------------------------------------------------------- >> binary_field: BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000 >> ENC:DELTA_BYTE_ARRAY >> int32_field: INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000 >> ENC:DELTA_BINARY_PACKED >> int64_field: INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000 >> ENC:RLE_DICTIONARY,PLAIN >> boolean_field: BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 ENC:RLE >> float_field: FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000 >> ENC:RLE_DICTIONARY,PLAIN >> double_field: DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000 >> ENC:RLE_DICTIONARY,PLAIN >> flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912 >> SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY >> int96_field: INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000 >> ENC:RLE_DICTIONARY,PLAIN >> >> parquet-tools dump fails when dumping the fixed len byte array field: >> >> FIXED_LEN_BYTE_ARRAY flba_field >> >> -------------------------------------------------------------------------------- >> parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only >> supported for type BINARY >> at parquet.column.Encoding$7.getValuesReader(Encoding.java:196) >> at >> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537) >> at >> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577) >> at >> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57) >> at >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521) >> at >> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513) >> at parquet.column.page.DataPageV2.accept(DataPageV2.java:141) >> at >> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513) >> at >> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505) >> at >> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607) >> at >> parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351) >> at >> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66) >> at >> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61) >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278) >> at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215) >> at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136) >> at parquet.tools.Main.main(Main.java:219) >> Encoding DELTA_BYTE_ARRAY is only supported for type BINARY >> >> >> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
