gene-bordegaray opened a new issue, #20176:
URL: https://github.com/apache/datafusion/issues/20176

   ### Describe the bug
   
   When preserving file partitioning we claim "hash" partitioned on the columns 
values. Then when applying a dynamic filter for partitioned joins it creates a 
filter based on a true hash function, not key/value based like file 
partitioning. This causes partitions to be filtered incorrectly:
   
   ```text
   Declare file partitioned: Hash([partition_col], N)
   Filter with hash-based routing:
     CASE hash(key) % N
       WHEN 0 THEN <partition 0 bounds>
       WHEN 1 THEN <partition 1 bounds>
       ...
   
   Since file partitioned is using value based partitioning and dynamic 
filtering is using true hash based end of in this scenario:
     ┌─────┬──────────────────────┬───────────────┐
     │ Key │ Hive Partition Index │ hash(key) % 3 │
     ├─────┼──────────────────────┼───────────────┤
     │ "A" │ 0                    │ 2             │
     ├─────┼──────────────────────┼───────────────┤
     │ "B" │ 1                    │ 2             │
     ├─────┼──────────────────────┼───────────────┤
     │ "C" │ 2                    │ 2             │
     └─────┴──────────────────────┴───────────────┘
   
   thus, when the filter evaluates row f_dkey=A, the filter computes it to map 
to partition 2 bounds even though it is actually in partition 0. So it doesn't 
match and is incorrectly filtered out.
   ```
   
   ### To Reproduce
   
   see PR: #20175 
   
   ### Expected behavior
   
   For now dynamic filtering will be disabled for partitioned hash joins when 
preserve file partitions is on.
   
   A long term fix is to introduce a new type of partitioning for the file 
partitioning to safely distinguish the two. Something like `KeyPartitoned` or 
`ValuePartitioned` is suiting.
   
   Oracle calls this 
[`ListPartitioning`](https://docs.oracle.com/en/database/oracle/oracle-database/26/cncpt/partitions-views-and-other-schema-objects.html)
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to