[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1404: Hash partitioning not working properly

GitBox Mon, 06 Dec 2021 08:11:46 -0800


andrei-ionescu edited a comment on issue #1404:
URL: 
https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-986916144



   @Dandandan A few questions:
   
   1. Why do we have yet another ticket? It will side track from the real issue.
   2. The hash partitioning is not working correctly. The issue is not the fact 
that there are collected partitions with `0` rows. The issue is the fact that 
the collected partitions do not contains correct data. If you look at the 
number of rows on the manual partitioning and repartition by hash there are 
collected partitions that have more than any manual partition. For example 
there is the `Partition=35, rows_in_part=943` that has `943` rows while in the 
table above the maximum is `571`. This means that the `Partition=35, 
rows_in_part=943` contains rows from other partition.
   
   The expected behaviour of repartition is the following...
   
   Given the dataset on left after repartitioning it by the 
`nlm_dimension_load_date` and `src_fuel_type` columns I expect the 
`collect_partitioned` to give me the result on the right:
   ```
   +-------------------------+---------------+-----+       P1 | 
2020-10-18T00:29:41Z | Steam    | 11 |
   | nlm_dimension_load_date | src_fuel_type |  id |          | 
2020-10-18T00:29:41Z | Steam    | 22 |
   +-------------------------+---------------+-----+          | 
2020-10-18T00:29:41Z | Steam    | 33 |
   | 2020-10-18T00:29:41Z    | Steam         |  11 |
   | 2020-10-18T00:29:41Z    | Steam         |  22 |
   | 2020-10-18T00:29:41Z    | Steam         |  33 |       P2 | 
2021-06-09T00:32:40Z | Gas      |  3 |
   | 2020-10-18T00:29:41Z    | Gas           |   1 |          | 
2021-06-09T00:32:40Z | Gas      |  2 |
   | 2021-06-09T00:32:40Z    | Gas           |   3 |        
   | 2021-06-09T00:32:40Z    | Gas           |   2 |
   | 2021-06-09T00:32:40Z    | Electric      |  a1 |       P3 | 
2021-06-09T00:32:40Z | Electric | a1 |
   | 2020-10-18T00:29:41Z    | Electric      |  b1 |                 
   | 2020-10-18T00:29:41Z    | Electric      |  c1 |        
   | 2020-10-18T00:29:41Z    | Electric      |  d1 |       P4 | 
2020-10-18T00:29:41Z | Gas      |  1 |
   +-------------------------+---------------+-----+        
                                                            
                                                           P5 | 
2020-10-18T00:29:41Z | Electric | b1 |
                                                              | 
2020-10-18T00:29:41Z | Electric | c1 |
                                                              | 
2020-10-18T00:29:41Z | Electric | d1 |
   ```
   
   Where `P1` to `P5` represents a `Vet<RecordBatch>`.
   
   Hashing by the two column should always give the same result - 
`hash(2020-10-18T00:29:41Z, Steam)` gives always the same result. First three 
rows, based on the hashing on `2020-10-18T00:29:41Z, Steam`, should end up in a 
partition.
   
   This is NOT happening in the current implementation.
   
   In the 72 partition example there are partitions containing different values 
from `nlm_dimension_load_date`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1404: Hash partitioning not working properly

Reply via email to