[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

cloud-fan Tue, 29 Dec 2015 22:58:12 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48589312
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
    @@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends 
UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            0 for true, 1 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l 
XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each 
element, and aggregate them by
    + *                        `result = result * 37 + elementHash` with an 
initial value `result = 37`.
    + *  - map:                recursively calculate hash value for each 
key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each 
field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by 
`UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by the same 
way of array.
    + *
    + * This hash algorithm is basically same with 
`GenericInternalRow.hashCode`, but using this hash
    + * expression is better as it can produce consistent hash values between 
safe and unsafe data
    + * structure, and can be slightly faster by codegen.
    + * It's also the hash function for both shuffle and bucketing, so that we 
can guarantee shuffle and
    + * bucketing have same data distribution.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    --- End diff --
    
    good point!
    after decided to not follow hive, I agree Mumur3Hash is a better choice.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Reply via email to