[GitHub] [arrow-datafusion] pjmore opened a new issue, #2145: Join on path partitioned columns fails with error

GitBox Sun, 03 Apr 2022 15:27:55 -0700


pjmore opened a new issue, #2145:
URL: https://github.com/apache/arrow-datafusion/issues/2145


   **Describe the bug**
   Dictionary types aren't supported by the hasher. Attempting to join on two 
partitioned columns panics with:
   ```
   thread 'parquet_join_on_partition_columns' panicked at 'called 
`Result::unwrap()` on an `Err` value: Internal("Unsupported data type in 
hasher")', datafusion/src/physical_plan/hash_join.rs:628:6
   ```
   This isn't an issue when joining partition column with another string 
column, I assume due to type coercion rules casting partition column to Utf8.
   
   **To Reproduce**
   I did this in using the path_partition tests using slightly modified version 
of register_partitioned_alltypes_parquet that allows setting the name of the 
registered table.
   ```
   async fn parquet_join_on_partition_columns()->Result<()>{
       let mut ctx = ExecutionContext::new();
   
       register_partitioned_alltypes_parquet_with_name(
           &mut ctx,
           &[
               "year=2021/month=09/day=09/file.parquet",
               "year=2021/month=10/day=09/file.parquet",
               "year=2021/month=10/day=28/file.parquet",
           ],
           &["year", "month", "day"],
           "",
           "alltypes_plain.parquet",
           "left"
       )
       .await;
    
   
       register_partitioned_alltypes_parquet_with_name(
           &mut ctx,
           &[
               "year=2021/month=09/day=09/file.parquet",
               "year=2021/month=10/day=09/file.parquet",
               "year=2021/month=10/day=28/file.parquet",
           ],
           &["year", "month", "day"],
           "",
           "alltypes_plain.parquet",
           "right"
       )
       .await;
   
       let _results = ctx.sql("SELECT left.id as lid, right.id as rid from left 
inner join right on left.month = right.month")
           .await?
           .collect()
           .await?;
   
       Ok(())
   }
   ```
   
   **Expected behavior**
   Query should return results without error.
   
   
   I'm not sure how this should be solved. It would be pretty straightforward 
to add casts into either the planner or the join implementation into a string 
array but this would cause an increase in memory usage. Another option would be 
to add implementations to the hasher, but this will add a large number of 
additional branches if other partition column types are eventually supported. I 
know that a row based join format for joins was being implemented but I think 
that the same issue with regards to having to cast from dictionary -> concrete 
type will cause an increase in memory usage with the row based version as well. 
 This would be solvable there with a hashmap that is external to the row which 
globally manages a integer -> T map and encodes the value of the partitioned 
column as a integer in the row.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] pjmore opened a new issue, #2145: Join on path partitioned columns fails with error

Reply via email to