Joris Van den Bossche created ARROW-9146:
--------------------------------------------
Summary: [C++][Dataset] Scanning a Fragment with a filter +
mismatching schema shouldn't abort
Key: ARROW-9146
URL: https://issues.apache.org/jira/browse/ARROW-9146
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Joris Van den Bossche
If you have filter on a column where the physical and dataset schema differs,
scanning aborts (as right now the dataset schema, if specified, gets used for
implicit casts, but then the expression might have a different type as the
actual physical column):
Small parquet file with one int32 column:
{code:python}
df = pd.DataFrame({"col": np.array([1, 2, 3, 4], dtype='int32')})
df.to_parquet("test_filter_schema.parquet", engine="pyarrow")
import pyarrow.dataset as ds
dataset = ds.dataset("test_filter_schema.parquet", format="parquet")
fragment = list(dataset.get_fragments())[0]
{code}
and then reading in a fragment with a filter on that column, without and with
specifying a dataset/read schema:
{code}
In [48]: fragment.to_table(filter=ds.field("col") > 2).to_pandas()
Out[48]:
col
0 3
1 4
In [49]: fragment.to_table(filter=ds.field("col") > 2,
schema=pa.schema([("col", pa.int64())])).to_pandas()
../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot
compare scalars of differing type: int64 vs int32
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.100(+0xee2f86)[0x7f6b56490f86]
...
Aborted (core dumped)
{code}
Now this int32->int64 type change is something we don't support yet (in the
schema evolution/normalization, when scanning a dataset), but it also shouldn't
abort but raise a normal error about type mismatch.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)