[
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883394#comment-15883394
]
Reynold Xin commented on SPARK-17495:
-------------------------------------
Let me put some thoughts here .... Please let me know if I missed anything:
1. On the read side we shouldn't care which hash function to use. All we need
to know is that the data is hash partitioned by some hash function, and that
should be sufficient to remove the shuffle needed in aggregation or join.
2. On the write side it does matter. In this case if we are writing to a Hive
bucketed table, the Hive hash function should be used. Otherwise a Spark hash
function should be used. This can perhaps be an option in the writer interface,
and automatically populated for catalog tables based on what kind of table it
is.
3. In general it'd be useful to allow users to configure which actual hash
function "hash" maps to. This can be a dynamic config.
> Hive hash implementation
> ------------------------
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Tejas Patil
> Assignee: Tejas Patil
> Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from
> the one used by Hive. For queries which use bucketing this leads to different
> results if one tries the same query on both engines. For us, we want users to
> have backward compatibility to that one can switch parts of applications
> across the engines without observing regressions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]