Hi all,
I have generated some test data using the method here
<https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java#L68>.
What I notice is if I use WriterVersion.PARQUET_2_0, the default block and
page sizes, and GZIP compression (test case 1 below) I cannot read the file
with parquet-tools dump (see stack trace below). When I switch to
PARQUET_1_0 (test case 2 below) I can use dump tool to read the data. Weird
enough when I reduce the number of rows I create to 1K and use PARQUET_2_0
writer again (test case 3) dump still fails but with a different exception.
Are these known issues?
Nezih
Test Case 1 [FAILS]
WriterVersion.PARQUET_2_0
default block and page size
GZIP compression
1M rows
Schema:
file schema: test
--------------------------------------------------------------------------------
binary_field: REQUIRED BINARY R:0 D:0
int32_field: REQUIRED INT32 R:0 D:0
int64_field: REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field: REQUIRED FLOAT R:0 D:0
double_field: REQUIRED DOUBLE R:0 D:0
flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field: REQUIRED INT96 R:0 D:0
row group 1: RC:1000000 TS:38744008 OFFSET:4
--------------------------------------------------------------------------------
binary_field: BINARY GZIP DO:0 FPO:4 SZ:20683253/36526089/1.77
VC:1000000 ENC:DELTA_BYTE_ARRAY
int32_field: INT32 GZIP DO:0 FPO:20683257 SZ:524/39330/75.06
VC:1000000 ENC:DELTA_BINARY_PACKED
int64_field: INT64 GZIP DO:0 FPO:20683781 SZ:693/498/0.72
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
boolean_field: BOOLEAN GZIP DO:0 FPO:20684474 SZ:63/43/0.68 VC:1000000 ENC:RLE
float_field: FLOAT GZIP DO:0 FPO:20684537 SZ:362/242/0.67
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
double_field: DOUBLE GZIP DO:0 FPO:20684899 SZ:694/498/0.72
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:20685593
SZ:2118034/2176246/1.03 VC:1000000 ENC:DELTA_BYTE_ARRAY
int96_field: INT96 GZIP DO:0 FPO:22803627 SZ:1413/1062/0.75
VC:1000000 ENC:PLAIN,RLE_DICTIONARY
parquet-tools dump fails with:
value 377601: R:0 D:0 V:parquet.io.ParquetDecodingException: Can't
read value in column [binary_field] BINARY at value 377601 out of
1000000, 1 out of 23600 in currentPage. repetition level: 0,
definition level: 0
at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
at parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:410)
at parquet.tools.command.DumpCommand.dump(DumpCommand.java:288)
at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
at parquet.tools.Main.main(Main.java:219)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at
parquet.column.values.deltastrings.DeltaByteArrayReader.readBytes(DeltaByteArrayReader.java:70)
at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:307)
at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:458)
... 5 more
Can't read value in column [binary_field] BINARY at value 377601 out
of 1000000, 1 out of 23600 in currentPage. repetition level: 0,
definition level: 0
Test Case 2 [SUCCEEDS]
WriterVersion.PARQUET_1_0
default block and page size
GZIP compression
1M rows
Schema:
file schema: test
--------------------------------------------------------------------------------
binary_field: REQUIRED BINARY R:0 D:0
int32_field: REQUIRED INT32 R:0 D:0
int64_field: REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field: REQUIRED FLOAT R:0 D:0
double_field: REQUIRED DOUBLE R:0 D:0
flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field: REQUIRED INT96 R:0 D:0
row group 1: RC:1000000 TS:1070161196 OFFSET:4
--------------------------------------------------------------------------------
binary_field: BINARY GZIP DO:0 FPO:4 SZ:21862183/40004054/1.83
VC:1000000 ENC:PLAIN,BIT_PACKED
int32_field: INT32 GZIP DO:0 FPO:21862187 SZ:1383313/4000159/2.89
VC:1000000 ENC:PLAIN,BIT_PACKED
int64_field: INT64 GZIP DO:0 FPO:23245500 SZ:572/397/0.69
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
boolean_field: BOOLEAN GZIP DO:0 FPO:23246072 SZ:188/125032/665.06
VC:1000000 ENC:PLAIN,BIT_PACKED
float_field: FLOAT GZIP DO:0 FPO:23246260 SZ:273/173/0.63
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
double_field: DOUBLE GZIP DO:0 FPO:23246533 SZ:573/397/0.69
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:23247106
SZ:3057410/1026030079/335.59 VC:1000000 ENC:PLAIN,BIT_PACKED
int96_field: INT96 GZIP DO:0 FPO:26304516 SZ:1236/905/0.73
VC:1000000 ENC:PLAIN_DICTIONARY,BIT_PACKED
Test Case 3 [FAILS]
WriterVersion.PARQUET_2_0
default block and page size
GZIP compression
1K rows
Schema:
file schema: test
--------------------------------------------------------------------------------
binary_field: REQUIRED BINARY R:0 D:0
int32_field: REQUIRED INT32 R:0 D:0
int64_field: REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field: REQUIRED FLOAT R:0 D:0
double_field: REQUIRED DOUBLE R:0 D:0
flba_field: REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0
int96_field: REQUIRED INT96 R:0 D:0
row group 1: RC:1000 TS:40502 OFFSET:4
--------------------------------------------------------------------------------
binary_field: BINARY GZIP DO:0 FPO:4 SZ:21466/36672/1.71 VC:1000
ENC:DELTA_BYTE_ARRAY
int32_field: INT32 GZIP DO:0 FPO:21470 SZ:70/85/1.21 VC:1000
ENC:DELTA_BINARY_PACKED
int64_field: INT64 GZIP DO:0 FPO:21540 SZ:106/71/0.67 VC:1000
ENC:RLE_DICTIONARY,PLAIN
boolean_field: BOOLEAN GZIP DO:0 FPO:21646 SZ:60/40/0.67 VC:1000 ENC:RLE
float_field: FLOAT GZIP DO:0 FPO:21706 SZ:99/59/0.60 VC:1000
ENC:RLE_DICTIONARY,PLAIN
double_field: DOUBLE GZIP DO:0 FPO:21805 SZ:107/71/0.66 VC:1000
ENC:RLE_DICTIONARY,PLAIN
flba_field: FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:21912
SZ:2152/3421/1.59 VC:1000 ENC:DELTA_BYTE_ARRAY
int96_field: INT96 GZIP DO:0 FPO:24064 SZ:114/83/0.73 VC:1000
ENC:RLE_DICTIONARY,PLAIN
parquet-tools dump fails when dumping the fixed len byte array field:
FIXED_LEN_BYTE_ARRAY flba_field
--------------------------------------------------------------------------------
parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is only
supported for type BINARY
at parquet.column.Encoding$7.getValuesReader(Encoding.java:196)
at
parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:537)
at
parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:577)
at parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:57)
at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:521)
at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
at parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:513)
at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:505)
at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:351)
at
parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:66)
at
parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:61)
at parquet.tools.command.DumpCommand.dump(DumpCommand.java:278)
at parquet.tools.command.DumpCommand.dump(DumpCommand.java:215)
at parquet.tools.command.DumpCommand.execute(DumpCommand.java:136)
at parquet.tools.Main.main(Main.java:219)
Encoding DELTA_BYTE_ARRAY is only supported for type BINARY