cjrh commented on issue #30302: URL: https://github.com/apache/arrow/issues/30302#issuecomment-2889249559
@lesterfan I have a directory of parquet files. For a specific categorical column, some parquet files use int8 and some use int16. In pyarrow 19.0.1, reading the directory as a dataset succeeds. But with pyarrow 20, it fails with this error when loading data from the dataset directory: Reading code: ```python import pandas as pd df = pd.read_parquet( path, engine="pyarrow", ) ``` Traceback: ``` ... File "/app/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1475, in read table = self._dataset.to_table( ^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_dataset.pyx", line 589, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 3941, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Integer value 731 not in range: -128 to 127 ``` This shows the dictionary info of the parquet files in the directory: ``` >>> import pyarrow.dataset as dataset >>> ds = dataset(path) >>> for path in ds.files: ... sch = pq.read_schema(path) ... print(path, sch.field('ExpStartDate').type) ... dataframes.parq/00eac90ef2f504223a74498405e060a48.parquet dictionary<values=string, indices=int8, ordered=0> dataframes.parq/0641c30f725cd448bafc335d36cd01f6b.parquet dictionary<values=string, indices=int16, ordered=0> dataframes.parq/0cb2799478dd54c738efe76fdc1875326.parquet dictionary<values=string, indices=int8, ordered=0> dataframes.parq/0cff477be69be4ee093d98728d4f84452.parquet dictionary<values=string, indices=int16, ordered=0> dataframes.parq/0d103de6323904e93aecf24589c12a370.parquet dictionary<values=string, indices=int8, ordered=0> ``` Is my issue related to this change? Is there a way to restore the previous behaviour of upcasting to int32 on read? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org