Joris Van den Bossche created ARROW-9146:
--------------------------------------------

             Summary: [C++][Dataset] Scanning a Fragment with a filter + 
mismatching schema shouldn't abort
                 Key: ARROW-9146
                 URL: https://issues.apache.org/jira/browse/ARROW-9146
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Joris Van den Bossche


If you have filter on a column where the physical and dataset schema differs, 
scanning aborts (as right now the dataset schema, if specified, gets used for 
implicit casts, but then the expression might have a different type as the 
actual physical column):

Small parquet file with one int32 column:
{code:python}
df = pd.DataFrame({"col": np.array([1, 2, 3, 4], dtype='int32')})
df.to_parquet("test_filter_schema.parquet", engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_filter_schema.parquet", format="parquet")
fragment = list(dataset.get_fragments())[0]
{code}

and then reading in a fragment with a filter on that column, without and with 
specifying a dataset/read schema:

{code}
In [48]: fragment.to_table(filter=ds.field("col") > 2).to_pandas() 
Out[48]: 
   col
0    3
1    4

In [49]: fragment.to_table(filter=ds.field("col") > 2, 
schema=pa.schema([("col", pa.int64())])).to_pandas()                            
                                                                            
../src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot 
compare scalars of differing type: int64 vs int32
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.100(+0xee2f86)[0x7f6b56490f86]
...
Aborted (core dumped)
{code}

Now this int32->int64 type change is something we don't support yet (in the 
schema evolution/normalization, when scanning a dataset), but it also shouldn't 
abort but raise a normal error about type mismatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to