Yep I will, seemed like a bug to me too.

Thanks,
Nezih

On Thu, Jun 18, 2015 at 1:33 PM, Ryan Blue <[email protected]> wrote:

> The first issue looks like the delta byte array problem:
>
>   https://issues.apache.org/jira/browse/PARQUET-246
>
> The second one looks like the write side uses delta_byte_array for fixed,
> but the read side doesn't expect it. File a bug?
>
> rb
>
> On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote:
>
>> Hi all,
>>
>> I have generated some test data using the method here
>> <
>> https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68
>> >.
>>
>> What I notice is if I use WriterVersion.PARQUET_2_0, the default block and
>> page sizes, and GZIP compression (test case 1 below) I cannot read the
>> file
>> with parquet-tools dump (see stack trace below). When I switch to
>> PARQUET_1_0 (test case 2 below) I can use dump tool to read the data.
>> Weird
>> enough when I reduce the number of rows I create to 1K and use PARQUET_2_0
>> writer again (test case 3) dump still fails but with a different
>> exception.
>>
>> Are these known issues?
>>
>> Nezih
>> Test Case 1 [FAILS]
>>
>> WriterVersion.PARQUET_2_0
>> default block and page size
>> GZIP compression
>> 1M rows
>>
>> Schema:
>>
>> file schema:   test
>>
>> --------------------------------------------------------------------------------
>> binary_field:  REQUIRED BINARY R:0 D:0
>> int32_field:   REQUIRED INT32 R:0 D:0
>> int64_field:   REQUIRED INT64 R:0 D:0
>> boolean_field: REQUIRED BOOLEAN R:0 D:0
>> float_field:   REQUIRED FLOAT R:0 D:0
>> double_field:  REQUIRED DOUBLE R:0 D:0
>> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
>> int96_field:   REQUIRED INT96 R:0 D:0
>>
>> row group 1:   RC:1000000 TS:38744008 OFFSET:4
>>
>> --------------------------------------------------------------------------------
>> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
>> VC:1000000 ENC:DELTA_BYTE_ARRAY
>> int32_field:    INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
>> VC:1000000 ENC:DELTA_BINARY_PACKED
>> int64_field:    INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>> boolean_field:  BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000
>> ENC:RLE
>> float_field:    FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>> double_field:   DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
>> SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
>> int96_field:    INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
>> VC:1000000 ENC:PLAIN,RLE_DICTIONARY
>>
>> parquet-tools dump fails with:
>>
>> value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
>> read value in column [binary_field] BINARY at value 377601 out of
>> 1000000, 1 out of 23600 in currentPage. repetition level: 0,
>> definition level: 0
>>      at
>> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
>>      at
>> parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
>>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
>>      at parquet.tools.Main.main(Main.java:219)
>> Caused by: java.lang.ArrayIndexOutOfBoundsException
>>      at
>> parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
>>      at
>> parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
>>      at
>> parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
>>      ... 5 more
>> Can't read value in column [binary_field] BINARY at value 377601 out
>> of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
>> definition level: 0
>>
>> Test Case 2 [SUCCEEDS]
>>
>> WriterVersion.PARQUET_1_0
>> default block and page size
>> GZIP compression
>> 1M rows
>>
>> Schema:
>>
>> file schema:   test
>>
>> --------------------------------------------------------------------------------
>> binary_field:  REQUIRED BINARY R:0 D:0
>> int32_field:   REQUIRED INT32 R:0 D:0
>> int64_field:   REQUIRED INT64 R:0 D:0
>> boolean_field: REQUIRED BOOLEAN R:0 D:0
>> float_field:   REQUIRED FLOAT R:0 D:0
>> double_field:  REQUIRED DOUBLE R:0 D:0
>> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
>> int96_field:   REQUIRED INT96 R:0 D:0
>>
>> row group 1:   RC:1000000 TS:1070161196 OFFSET:4
>>
>> --------------------------------------------------------------------------------
>> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
>> VC:1000000 ENC:PLAIN,BIT_PACKED
>> int32_field:    INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
>> VC:1000000 ENC:PLAIN,BIT_PACKED
>> int64_field:    INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>> boolean_field:  BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
>> VC:1000000 ENC:PLAIN,BIT_PACKED
>> float_field:    FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>> double_field:   DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
>> SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
>> int96_field:    INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
>> VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
>>
>> Test Case 3 [FAILS]
>>
>> WriterVersion.PARQUET_2_0
>> default block and page size
>> GZIP compression
>> 1K rows
>>
>> Schema:
>>
>> file schema:   test
>>
>> --------------------------------------------------------------------------------
>> binary_field:  REQUIRED BINARY R:0 D:0
>> int32_field:   REQUIRED INT32 R:0 D:0
>> int64_field:   REQUIRED INT64 R:0 D:0
>> boolean_field: REQUIRED BOOLEAN R:0 D:0
>> float_field:   REQUIRED FLOAT R:0 D:0
>> double_field:  REQUIRED DOUBLE R:0 D:0
>> flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
>> int96_field:   REQUIRED INT96 R:0 D:0
>>
>> row group 1:   RC:1000 TS:40502 OFFSET:4
>>
>> --------------------------------------------------------------------------------
>> binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
>> ENC:DELTA_BYTE_ARRAY
>> int32_field:    INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
>> ENC:DELTA_BINARY_PACKED
>> int64_field:    INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>> boolean_field:  BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 ENC:RLE
>> float_field:    FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>> double_field:   DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>> flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
>> SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
>> int96_field:    INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
>> ENC:RLE_DICTIONARY,PLAIN
>>
>> parquet-tools dump fails when dumping the fixed len byte array field:
>>
>> FIXED_LEN_BYTE_ARRAY flba_field
>>
>> --------------------------------------------------------------------------------
>> parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
>> supported for type BINARY
>>      at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
>>      at
>> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
>>      at
>> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
>>      at
>> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
>>      at
>> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
>>      at
>> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
>>      at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
>>      at
>> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
>>      at
>> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
>>      at
>> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
>>      at
>> parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
>>      at
>> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
>>      at
>> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
>>      at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
>>      at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
>>      at parquet.tools.Main.main(Main.java:219)
>> Encoding DELTA_BYTE_ARRAY is only supported for type BINARY
>>
>> ​
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Reply via email to