Nishanth created ARROW-17453:
--------------------------------
Summary: Arrow-Parquet c++ - Parquet files show inconsistent data
Key: ARROW-17453
URL: https://issues.apache.org/jira/browse/ARROW-17453
Project: Apache Arrow
Issue Type: Bug
Reporter: Nishanth
Attachments: athena_datatypes.gz.parquet
Files generated from Athena using Iceberg and read via arrow-parquet c++ show
inconsistent data, with row information missing.
To repro, create athena iceberg table with parquet as below.
{code:java}
create table datatypes_helper (d_int1 int, d_bigint1 bigint, d_float1 float,
d_double1 double, d_decimal38_0_1 decimal(38,0), d_decimal19_10_1
decimal(19,10), d_date1 date, d_timestamp1 timestamp, d_string1 string,
d_binary1 binary, d_decimal1 decimal, d_array1 array<string>, d_map1
map<string, int>)
LOCATION 's3:/'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet'
);
insert into datatypes_helper VALUES (11, CAST(11 as bigint),
CAST(12.222222222100001 as real), 12.222222222222223, cast(11.0 as
decimal(38,0)), cast(12.358024679 as decimal(19,10)), date('2022-01-12'),
timestamp '2022-03-31 18:53:43', cast('IX1KNXR6KPF' as varchar),
cast(X'0000011100' as varbinary), cast(11 as decimal(10,0)), array[cast('11' as
varchar), cast('11' as varchar)], map(array[cast('rowNum' as varchar)],
array[CAST(11 AS INT)]));
insert into datatypes_helper VALUES (12, CAST(12 as bigint),
CAST(13.3333333332 as real), 13.333333333333334, cast(12.0 as decimal(38,0)),
cast(13.481481468 as decimal(19,10)), date('2022-01-13'), timestamp '2022-03-31
18:53:43', cast('WDJC5KD74I0B' as varchar), cast(X'000101111100' as varbinary),
cast(12 as decimal(10,0)), array[cast('12' as varchar), cast('12' as varchar)],
map(array[cast('rowNum' as varchar)], array[CAST(12 AS INT)]));
insert into datatypes_helper VALUES (13, CAST(13 as bigint),
CAST(14.4444444443 as real), 14.444444444444445, cast(13.0 as decimal(38,0)),
cast(14.604938257 as decimal(19,10)), date('2022-01-14'), timestamp '2022-03-31
18:53:43', cast('Y8S9T7QOIWPS0' as varchar), cast(X'110011111110' as
varbinary), cast(13 as decimal(10,0)), array[cast('13' as varchar), cast('13'
as varchar)], map(array[cast('rowNum' as varchar)], array[CAST(13 AS INT)]));
insert into datatypes_helper(d_int1) VALUES (null);
create table datatypes (d_int1 int, d_bigint1 bigint, d_float1 float, d_double1
double, d_decimal38_0_1 decimal(38,0), d_decimal19_10_1 decimal(19,10), d_date1
date, d_timestamp1 timestamp, d_string1 string, d_binary1 binary, d_decimal1
decimal, d_array1 array<string>, d_map1 map<string, int>)
LOCATION 's3://'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet'
);
insert into datatypes select * from datatypes_helper;
{code}
The above inserts 4 rows. However querying the data using pyarrow return only 2
rows for some columns. Others parquet tools (parquet-tools-mr do not show this
behavior)
{{}}
{code:java}
>>> parquet_file.read_row_group(0)
pyarrow.Table
d_int1: int32
d_bigint1: int64
d_float1: float
d_double1: double
d_decimal38_0_1: decimal128(38, 0)
d_decimal19_10_1: decimal128(19, 10)
d_date1: date32[day]
d_timestamp1: timestamp[us]
d_string1: string
d_binary1: binary
d_decimal1: decimal128(10, 0)
d_array1: list<element: string>
child 0, element: string
d_map1: map<string, int32 ('d_map1')>
child 0, d_map1: struct<key: string not null, value: int32> not null
child 0, key: string not null
child 1, value: int32
----
d_int1: [[12,11,null,null]]
d_bigint1: [[12,11,null,null]]
d_float1: [[13.333333,12.222222,null,null]]
d_double1: [[13.333333333333334,12.222222222222223,null,null]]
d_decimal38_0_1: [[12,11,null,null]]
d_decimal19_10_1: [[13.4814814680,12.3580246790,null,null]]
d_date1: [[2022-01-13,2022-01-12,null,null]]
d_timestamp1: [[2022-03-31 18:53:43.000000,2022-03-31
18:53:43.000000,null,null]]
d_string1: [["WDJC5KD74I0B","IX1KNXR6KPF",null,null]]
d_binary1: [[000101111100,0000011100,null,null]]{code}
The parquet file queried is attached to the JIRA.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)