andrei-ionescu commented on issue #1404:
URL: 
https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-986916144


   @Dandandan A few questions:
   
   1. Why do we have yet another ticket? It will side track from the real issue.
   2. The hash partitioning is not working correctly. The issue is not the fact 
that there are collected partitions with `0` rows. The issue is the fact that 
the collected partitions do not contains correct data.
   
   The expected behaviour of repartition is the following...
   
   Given the dataset on left after repartitioning it by the 
`nlm_dimension_load_date` and `src_fuel_type` columns I expect the 
`collect_partitioned` to give me the result on the right:
   ```
   +-------------------------+---------------+-----+       P1 | 
2020-10-18T00:29:41Z | Steam    | 11 |
   | nlm_dimension_load_date | src_fuel_type |  id |          | 
2020-10-18T00:29:41Z | Steam    | 22 |
   +-------------------------+---------------+-----+          | 
2020-10-18T00:29:41Z | Steam    | 33 |
   | 2020-10-18T00:29:41Z    | Steam         |  11 |
   | 2020-10-18T00:29:41Z    | Steam         |  22 |
   | 2020-10-18T00:29:41Z    | Steam         |  33 |       P2 | 
2021-06-09T00:32:40Z | Gas      |  3 |
   | 2020-10-18T00:29:41Z    | Gas           |   1 |          | 
2021-06-09T00:32:40Z | Gas      |  2 |
   | 2021-06-09T00:32:40Z    | Gas           |   3 |        
   | 2021-06-09T00:32:40Z    | Gas           |   2 |
   | 2021-06-09T00:32:40Z    | Electric      |  a1 |       P3 | 
2021-06-09T00:32:40Z | Electric | a1 |
   | 2020-10-18T00:29:41Z    | Electric      |  b1 |                 
   | 2020-10-18T00:29:41Z    | Electric      |  c1 |        
   | 2020-10-18T00:29:41Z    | Electric      |  d1 |       P4 | 
2020-10-18T00:29:41Z | Gas      |  1 |
   +-------------------------+---------------+-----+        
                                                            
                                                           P5 | 
2020-10-18T00:29:41Z | Electric | b1 |
                                                              | 
2020-10-18T00:29:41Z | Electric | c1 |
                                                              | 
2020-10-18T00:29:41Z | Electric | d1 |
   ```
   
   Where `P1` to `P5` represents a `Vet<RecordBatch>`.
   
   Hashing by the two column should always give the same result - 
`hash(2020-10-18T00:29:41Z, Steam)` gives always the same result. First three 
rows, based on the hashing on `2020-10-18T00:29:41Z, Steam`, should end up in a 
partition.
   
   This is NOT happening in the current implementation.
   
   In the 72 partition example there are partitions containing different values 
from `nlm_dimension_load_date`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to