[
https://issues.apache.org/jira/browse/DRILL-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mehant Baid updated DRILL-2803:
-------------------------------
Assignee: Jacques Nadeau (was: Mehant Baid)
> Severe skew due to null values in columns even when other columns are non-null
> ------------------------------------------------------------------------------
>
> Key: DRILL-2803
> URL: https://issues.apache.org/jira/browse/DRILL-2803
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 0.8.0
> Reporter: Aman Sinha
> Assignee: Jacques Nadeau
> Fix For: 0.9.0
>
> Attachments: DRILL-2803.patch
>
>
> If you have 2 columns that are hashed (either for distribution or for hash
> based operators) and one of those columns has lots of null values, it can
> result in substantial skew even if the other column has non-null values.
> In the following query the combined hash value of 2 columns is 0 even when 1
> column is non-null. The reason is that if the starting value is null (for
> cr_reason_sk all values are null in the above query), it does not matter what
> seed is passed in. The hash function treats the second parameter as a seed
> and not as a combiner, so it gets ignored.
> {code}
> select cr_call_center_sk, cr_reason_sk, hash64(cr_reason_sk,
> hash64(cr_call_center_sk)) as hash_value from catalog_returns where
> cr_reason_sk is null and cr_call_center_sk is not null limit 10;
> +-------------------+--------------+------------+
> | cr_call_center_sk | cr_reason_sk | hash_value |
> +-------------------+--------------+------------+
> | 1 | null | 0 |
> | 1 | null | 0 |
> | 4 | null | 0 |
> | 1 | null | 0 |
> | 4 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> +-------------------+--------------+------------+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)