erratic-pattern opened a new issue, #20937:
URL: https://github.com/apache/datafusion/issues/20937
### Describe the bug
When a `VALUES` clause is joined against a Parquet table with
`Dictionary(Int32, Utf8)` columns and `pushdown_filters = true`, the query
fails with:
```
Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"),
Some(""))
```
The dynamic filter pushdown from the `HashJoinExec` creates an `InListExpr`
that is pushed into the Parquet scan.
[`ArrayStaticFilter::contains()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L90)
unwraps the needle (the Dictionary array from the Parquet batch) to its plain
Utf8 values via
[`downcast_dictionary_array!`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L102-L108),
but does not unwrap `in_array` (the stored InList values, which are also
Dictionary from the type-coerced join build side). This results in a
[`make_comparator(Utf8,
Dictionary)`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L116)
call, which
[`arrow_ord::ord::build_compare`](https://docs.rs/arrow-ord/57.3.0/src/arrow_ord/ord
.rs.html#475-490) does not support for mixed Dictionary/non-Dictionary types.
### To Reproduce
Using the DataFusion CLI (`datafusion-cli`):
```sql
-- Enable row-level filter pushdown (required to trigger the bug)
SET datafusion.execution.parquet.pushdown_filters = true;
SET datafusion.execution.parquet.reorder_filters = true;
-- Create a Parquet file with Dictionary-encoded string columns
COPY (
SELECT
arrow_cast(chr(65 + (row_num % 26)), 'Dictionary(Int32, Utf8)') as tag1,
row_num * 1.0 as value
FROM (SELECT unnest(range(0, 10000)) as row_num)
) TO '/tmp/dict_filter_bug.parquet';
-- This query fails
SELECT t.tag1, t.value
FROM '/tmp/dict_filter_bug.parquet' t
JOIN (VALUES ('A'), ('B')) AS v(c1)
ON t.tag1 = v.c1;
```
### Expected behavior
The query should return matching rows (tag1 = 'A' or 'B').
### Actual behavior
```
Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"),
Some(""))
```
### Root cause
The error path through the code:
1. The VALUES clause produces `Utf8` strings. Type coercion casts them to
`Dictionary(Int32, Utf8)` to match the Parquet column type for the join key.
2. The `HashJoinExec` builds the dynamic filter InList from the cast
build-side arrays → `in_array` is `Dictionary(Int32, Utf8)`.
3.
[`instantiate_static_filter()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L157-L159)
receives the Dictionary array and falls through to
`ArrayStaticFilter::try_new()`, which stores `in_array` as-is.
4.
[`InListExpr::try_new_from_array()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L647)
is called from
[`create_membership_predicate()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L124-L128)
in the dynamic filter pushdown code.
5. At runtime, `ArrayStaticFilter::contains(v)` is called with a Dictionary
array from the Parquet batch.
6.
[`downcast_dictionary_array!`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L102-L108)
matches and recursively calls `self.contains(v.values())` with the unwrapped
Utf8 values.
7. In the recursive call, `v` is now `Utf8` but `self.in_array` is still
`Dictionary(Int32, Utf8)`.
8. [`make_comparator(Utf8,
Dictionary)`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L116)
fails because `arrow_ord::ord::build_compare` has no arm for mixed
Dictionary/non-Dictionary types.
### Possible fix
The most straightforward fix would be to normalize `in_array` in
[`ArrayStaticFilter::try_new()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L187)
— if the input is a Dictionary, unwrap it to its values array before storing.
This is a one-time cost at filter construction, and ensures that after
`contains()` unwraps the needle Dictionary, both sides are the same plain type
and the existing arrow comparison kernels work correctly.
### Additional context
- This only manifests when `pushdown_filters = true` (defaults to `false`),
because the row-level filter evaluation inside the Parquet reader is what
triggers `ArrayStaticFilter::contains()` with Dictionary arrays. With
`pushdown_filters = false`, the InList filter is only used for statistics/bloom
filter pruning (which works correctly), and the actual row filtering happens in
the HashJoin.
- The bug applies to single-column equi-joins where the VALUES side is cast
to Dictionary by type coercion and the dynamic filter InList path is taken
(small build side).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]