Github user adoron commented on the issue:

    https://github.com/apache/spark/pull/23043
  
    @cloud-fan that's what I thought as well at first, but the flow doesn't go 
through that code -
    running `Seq(0.0d, 0.0d, -0.0d).toDF("i").groupBy("i").count().collect()` 
and adding a breakpoint.
    
    The reason for -0.0 and 0.0 being put in different buckets of "group by" is 
in UnsafeFixedWidthAggregationMap::getAggregationBufferFromUnsafeRow():
    ```
    public UnsafeRow getAggregationBufferFromUnsafeRow(UnsafeRow key) {
        return getAggregationBufferFromUnsafeRow(key, key.hashCode());
    }
    ```
    The hashing is done on the UnsafeRow, and by this point the whole row is 
hashed as a unit and it's hard to find the double columns and their value.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to