[
https://issues.apache.org/jira/browse/DRILL-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chun Chang closed DRILL-2803.
-----------------------------
Assignee: Chun Chang (was: Jacques Nadeau)
verified fix
{code}
0: jdbc:drill:schema=dfs.drillTestDirAdvanced> select t.id, t.nul,
hash64(t.nul, hash64(t.id)) as hash_value from `complex.json` t where t.nul is
null order by t.id limit 10;
+------------+------------+------------+
| id | nul | hash_value |
+------------+------------+------------+
| 2 | null | 2087691676675520095 |
| 7 | null | 1124679897442190205 |
| 8 | null | 6873016763837711824 |
| 9 | null | 5689101538750744972 |
| 14 | null | 3054377820775748426 |
| 15 | null | 7945380393500328936 |
| 18 | null | 3527854738750367914 |
| 20 | null | 3939724015981009335 |
| 23 | null | 6367888492325306020 |
| 25 | null | 1786113389955931412 |
+------------+------------+------------+
10 rows selected (13.679 seconds)
0: jdbc:drill:schema=dfs.drillTestDirAdvanced> select * from sys.version;
+------------+----------------+-------------+-------------+------------+
| commit_id | commit_message | commit_time | build_email | build_time |
+------------+----------------+-------------+-------------+------------+
| 57a96d200e12c0efcad3f3ca9d935c42647234b1 | DRILL-2083: Fix bug in merging
receiver | 27.04.2015 @ 17:12:13 EDT | Unknown | 27.04.2015 @ 23:19:52 EDT |
+------------+----------------+-------------+-------------+------------+
{code}
add a new test case - complex320.q
> Severe skew due to null values in columns even when other columns are non-null
> ------------------------------------------------------------------------------
>
> Key: DRILL-2803
> URL: https://issues.apache.org/jira/browse/DRILL-2803
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 0.8.0
> Reporter: Aman Sinha
> Assignee: Chun Chang
> Fix For: 0.9.0
>
> Attachments: DRILL-2803.patch
>
>
> If you have 2 columns that are hashed (either for distribution or for hash
> based operators) and one of those columns has lots of null values, it can
> result in substantial skew even if the other column has non-null values.
> In the following query the combined hash value of 2 columns is 0 even when 1
> column is non-null. The reason is that if the starting value is null (for
> cr_reason_sk all values are null in the above query), it does not matter what
> seed is passed in. The hash function treats the second parameter as a seed
> and not as a combiner, so it gets ignored.
> {code}
> select cr_call_center_sk, cr_reason_sk, hash64(cr_reason_sk,
> hash64(cr_call_center_sk)) as hash_value from catalog_returns where
> cr_reason_sk is null and cr_call_center_sk is not null limit 10;
> +-------------------+--------------+------------+
> | cr_call_center_sk | cr_reason_sk | hash_value |
> +-------------------+--------------+------------+
> | 1 | null | 0 |
> | 1 | null | 0 |
> | 4 | null | 0 |
> | 1 | null | 0 |
> | 4 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> | 2 | null | 0 |
> +-------------------+--------------+------------+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)