alamb commented on code in PR #20962:
URL: https://github.com/apache/datafusion/pull/20962#discussion_r2942147087


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -99,11 +99,16 @@ impl StaticFilter for ArrayStaticFilter {
             ));
         }
 
+        // Unwrap dictionary-encoded needles when the value type matches
+        // in_array, evaluating against distinct values and mapping back
+        // via keys.
         downcast_dictionary_array! {
             v => {
-                let values_contains = self.contains(v.values().as_ref(), 
negated)?;
-                let result = take(&values_contains, v.keys(), None)?;
-                return Ok(downcast_array(result.as_ref()))
+                if v.values().data_type() == self.in_array.data_type() {
+                    let values_contains = self.contains(v.values().as_ref(), 
negated)?;
+                    let result = take(&values_contains, v.keys(), None)?;
+                    return Ok(downcast_array(result.as_ref()));
+                }

Review Comment:
   Update -- I see it is now because it needs to fall through to use the 
generic comparator.



##########
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##########
@@ -918,13 +918,18 @@ CREATE EXTERNAL TABLE dict_filter_bug
 STORED AS PARQUET
 LOCATION 'test_files/scratch/parquet_filter_pushdown/dict_filter_bug.parquet';
 
-query error Can't compare arrays of different types
+query TR

Review Comment:
   👍 



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -99,11 +99,16 @@ impl StaticFilter for ArrayStaticFilter {
             ));
         }
 
+        // Unwrap dictionary-encoded needles when the value type matches
+        // in_array, evaluating against distinct values and mapping back
+        // via keys.
         downcast_dictionary_array! {
             v => {
-                let values_contains = self.contains(v.values().as_ref(), 
negated)?;
-                let result = take(&values_contains, v.keys(), None)?;
-                return Ok(downcast_array(result.as_ref()))
+                if v.values().data_type() == self.in_array.data_type() {

Review Comment:
   ```suggestion
                   // Only do unwrap when the haystack is the dictionary value 
type
                   // and the needle (self.in_array) is the dictionary value 
type
                   if v.values().data_type() == self.in_array.data_type() {
   ```



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -3878,4 +3883,221 @@ mod tests {
         );
         Ok(())
     }
+
+    // -----------------------------------------------------------------------
+    // Tests for try_new_from_array covering all (in_array, needle) type
+    // combinations that occur in HashJoin dynamic filter pushdown.
+    //
+    // try_new (used by SQL IN expressions) always produces a non-Dictionary
+    // in_array because evaluate_list() flattens Dictionary scalars to their
+    // value type. try_new_from_array passes the array directly, so it is
+    // the only path that can produce a Dictionary in_array.

Review Comment:
   I am a little confused about this comment -- it implies that the Dictionary 
is on the in_list side, but the code fix above is handling a Dictionary needle 
(`v`)  and a non Dictionary Haystack 🤔 
   
   I think it may be worthwhile to consider changing the code so that it 
doesn't support a Dictionary in in_array -- but rather normalizes the haystack 
(we can do this as a follow on PR)



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -99,11 +99,16 @@ impl StaticFilter for ArrayStaticFilter {
             ));
         }
 
+        // Unwrap dictionary-encoded needles when the value type matches
+        // in_array, evaluating against distinct values and mapping back
+        // via keys.
         downcast_dictionary_array! {
             v => {
-                let values_contains = self.contains(v.values().as_ref(), 
negated)?;
-                let result = take(&values_contains, v.keys(), None)?;
-                return Ok(downcast_array(result.as_ref()))
+                if v.values().data_type() == self.in_array.data_type() {
+                    let values_contains = self.contains(v.values().as_ref(), 
negated)?;
+                    let result = take(&values_contains, v.keys(), None)?;
+                    return Ok(downcast_array(result.as_ref()));
+                }

Review Comment:
   it seems strange that this will basically ignore arrays where 
`v.values().data_type()` is not the same as `self.in_array.data_type()`
   
   Perhaps we can turn this into an internal error if the types don't match 🤔 
   
   ```rust
     if v.values().data_type() == self.in_array.data_type() {
   ...
     } else {
       return internal_err!("Mismatched types. Expected value type to be {}, 
got {}", ...)?
   }
   ```
   
   



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -3878,4 +3883,221 @@ mod tests {
         );
         Ok(())
     }
+
+    // -----------------------------------------------------------------------
+    // Tests for try_new_from_array covering all (in_array, needle) type
+    // combinations that occur in HashJoin dynamic filter pushdown.
+    //
+    // try_new (used by SQL IN expressions) always produces a non-Dictionary
+    // in_array because evaluate_list() flattens Dictionary scalars to their
+    // value type. try_new_from_array passes the array directly, so it is
+    // the only path that can produce a Dictionary in_array.
+    // -----------------------------------------------------------------------
+
+    fn wrap_in_dict(array: ArrayRef) -> ArrayRef {
+        let keys = Int32Array::from((0..array.len() as 
i32).collect::<Vec<_>>());
+        Arc::new(DictionaryArray::new(keys, array))
+    }
+
+    fn eval_in_list_from_array(
+        needle_type: DataType,

Review Comment:
   if you need the type of `needle` can't w just call `needle.data_type` and 
that way be sure it is consistent with the `needle` array? That would make the 
tests less fragile, likely shorter, and easier to validate they are doing the 
right thing



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -99,11 +99,16 @@ impl StaticFilter for ArrayStaticFilter {
             ));
         }
 
+        // Unwrap dictionary-encoded needles when the value type matches
+        // in_array, evaluating against distinct values and mapping back
+        // via keys.

Review Comment:
   ```suggestion
           // Unwrap dictionary-encoded needles when the value type matches
           // in_array, evaluating against the dictionary values and mapping 
back
           // via keys.
   ```



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -3878,4 +3883,221 @@ mod tests {
         );
         Ok(())
     }
+
+    // -----------------------------------------------------------------------
+    // Tests for try_new_from_array covering all (in_array, needle) type
+    // combinations that occur in HashJoin dynamic filter pushdown.
+    //
+    // try_new (used by SQL IN expressions) always produces a non-Dictionary
+    // in_array because evaluate_list() flattens Dictionary scalars to their
+    // value type. try_new_from_array passes the array directly, so it is
+    // the only path that can produce a Dictionary in_array.
+    // -----------------------------------------------------------------------
+
+    fn wrap_in_dict(array: ArrayRef) -> ArrayRef {
+        let keys = Int32Array::from((0..array.len() as 
i32).collect::<Vec<_>>());
+        Arc::new(DictionaryArray::new(keys, array))
+    }
+
+    fn eval_in_list_from_array(

Review Comment:
   Can we please add comments about what this is doing? I think it is doing  
`needle IN in_array`
   
   It is also probably worth commenting that it is explicitly going down a 
different path than SQL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to