The first issue looks like the delta byte array problem:

  https://issues.apache.org/jira/browse/PARQUET-246

The second one looks like the write side uses delta_byte_array for fixed, but the read side doesn't expect it. File a bug?

rb

On 06/18/2015 12:50 PM, Nezih Yigitbasi wrote:
Hi all,

I have generated some test data using the method here
<https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68>.
What I notice is if I use WriterVersion.PARQUET_2_0, the default block and
page sizes, and GZIP compression (test case 1 below) I cannot read the file
with parquet-tools dump (see stack trace below). When I switch to
PARQUET_1_0 (test case 2 below) I can use dump tool to read the data. Weird
enough when I reduce the number of rows I create to 1K and use PARQUET_2_0
writer again (test case 3) dump still fails but with a different exception.

Are these known issues?

Nezih
Test Case 1 [FAILS]

WriterVersion.PARQUET_2_0
default block and page size
GZIP compression
1M rows

Schema:

file schema:   test
--------------------------------------------------------------------------------
binary_field:  REQUIRED BINARY R:0 D:0
int32_field:   REQUIRED INT32 R:0 D:0
int64_field:   REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field:   REQUIRED FLOAT R:0 D:0
double_field:  REQUIRED DOUBLE R:0 D:0
flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field:   REQUIRED INT96 R:0 D:0

row group 1:   RC:1000000 TS:38744008 OFFSET:4
--------------------------------------------------------------------------------
binary_field:   BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
VC:1000000 ENC:DELTA_BYTE_ARRAY
int32_field:    INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
VC:1000000 ENC:DELTA_BINARY_PACKED
int64_field:    INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
boolean_field:  BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000 ENC:RLE
float_field:    FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
double_field:   DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
int96_field:    INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
VC:1000000 ENC:PLAIN,RLE_DICTIONARY

parquet-tools dump fails with:

value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
read value in column [binary_field] BINARY at value 377601 out of
1000000, 1 out of 23600 in currentPage. repetition level: 0,
definition level: 0
     at 
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
     at 
parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
     at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
     at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
     at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
     at parquet.tools.Main.main(Main.java:219)
Caused by: java.lang.ArrayIndexOutOfBoundsException
     at 
parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
     at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
     at 
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
     ... 5 more
Can't read value in column [binary_field] BINARY at value 377601 out
of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
definition level: 0

Test Case 2 [SUCCEEDS]

WriterVersion.PARQUET_1_0
default block and page size
GZIP compression
1M rows

Schema:

file schema:   test
--------------------------------------------------------------------------------
binary_field:  REQUIRED BINARY R:0 D:0
int32_field:   REQUIRED INT32 R:0 D:0
int64_field:   REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field:   REQUIRED FLOAT R:0 D:0
double_field:  REQUIRED DOUBLE R:0 D:0
flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field:   REQUIRED INT96 R:0 D:0

row group 1:   RC:1000000 TS:1070161196 OFFSET:4
--------------------------------------------------------------------------------
binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
VC:1000000 ENC:PLAIN,BIT_PACKED
int32_field:    INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
VC:1000000 ENC:PLAIN,BIT_PACKED
int64_field:    INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
boolean_field:  BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
VC:1000000 ENC:PLAIN,BIT_PACKED
float_field:    FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
double_field:   DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
int96_field:    INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED

Test Case 3 [FAILS]

WriterVersion.PARQUET_2_0
default block and page size
GZIP compression
1K rows

Schema:

file schema:   test
--------------------------------------------------------------------------------
binary_field:  REQUIRED BINARY R:0 D:0
int32_field:   REQUIRED INT32 R:0 D:0
int64_field:   REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field:   REQUIRED FLOAT R:0 D:0
double_field:  REQUIRED DOUBLE R:0 D:0
flba_field:    REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field:   REQUIRED INT96 R:0 D:0

row group 1:   RC:1000 TS:40502 OFFSET:4
--------------------------------------------------------------------------------
binary_field:   BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
ENC:DELTA_BYTE_ARRAY
int32_field:    INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
ENC:DELTA_BINARY_PACKED
int64_field:    INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
ENC:RLE_DICTIONARY,PLAIN
boolean_field:  BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 ENC:RLE
float_field:    FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
ENC:RLE_DICTIONARY,PLAIN
double_field:   DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
ENC:RLE_DICTIONARY,PLAIN
flba_field:     FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
int96_field:    INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
ENC:RLE_DICTIONARY,PLAIN

parquet-tools dump fails when dumping the fixed len byte array field:

FIXED_LEN_BYTE_ARRAY flba_field
--------------------------------------------------------------------------------
parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
supported for type BINARY
     at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
     at 
parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
     at 
parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
     at 
parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
     at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
     at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
     at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
     at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
     at 
parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
     at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
     at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
     at 
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
     at 
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
     at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
     at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
     at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
     at parquet.tools.Main.main(Main.java:219)
Encoding DELTA_BYTE_ARRAY is only supported for type BINARY

​



--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to