Max Firman created ARROW-7350:
---------------------------------
Summary: [Python] Parquet file metadata min and max statistics not
decoded from bytes for Decimal data types
Key: ARROW-7350
URL: https://issues.apache.org/jira/browse/ARROW-7350
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Reporter: Max Firman
Parquet file metadata for Decimal type columns contain min and max values that
are not decoded from bytes into Decimals. This causes issues in dependent
libraries like Dask (see [https://github.com/dask/dask/issues/5647]).
{code:python|title=Reproducible example|borderStyle=solid}
from decimal import Decimal
import random
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
NUM_DATA_POINTS_PER_PARTITION = 25
random.seed(0)
data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")}
for i in range(NUM_DATA_POINTS_PER_PARTITION)]
df = pd.DataFrame(data1)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'my_data.parquet')
parquet_file = pq.ParquetFile('my_data.parquet')
assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min,
Decimal) # <-- AssertionError here because min has type bytes rather than
Decimal
assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max,
Decimal)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)