[
https://issues.apache.org/jira/browse/DRILL-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boaz Ben-Zvi updated DRILL-6880:
--------------------------------
Description:
When building the Hash Table for the Hash-Join, each new key is matched with an
existing key (same bucket) by calling the generated method
`isKeyMatchInternalBuild`, which compares the two. However when both keys are
null, the method returns *false* (meaning not-equal; i.e. it is a new key),
thus the new key is added into the list following the old key. When a third
null key is found, it would be matched with the prior two, and added as well.
Etc etc ...
This way many null values would perform checks at order N^2 / 2.
_Suggested improvement_: The generated code should return a third result,
meaning "two null keys". Then in case of Inner or Left joins all the duplicate
nulls can be discarded.
Below is a simple example, note the time difference between non-null and the
all-nulls tables (also instrumentation showed that for nulls, the method above
was called 1249975000 times!!)
{code:java}
0: jdbc:drill:zk=local> use dfs.tmp;
0: jdbc:drill:zk=local> create table testNull as (select cast(null as int)
mycol from
dfs.`/data/test128M.tbl` limit 50000);
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1
from
dfs.`/data/test128M.tbl` limit 60000);
0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2
from dfs.`/data/test128M.tbl` limit 50000);
0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 =
test2.mycol2;
+---------+
| EXPR$0 |
+---------+
| 0 |
+---------+
1 row selected (0.443 seconds)
0: jdbc:drill:zk=local> select count(*) from test1 join testNull on
test1.mycol1 = testNull.mycol;
+---------+
| EXPR$0 |
+---------+
| 0 |
+---------+
1 row selected (140.098 seconds)
{code}
was:
When building the Hash Table for the Hash-Join, each new key is matched with an
existing key (same bucket) by calling the generated method
`isKeyMatchInternalBuild`, which compares the two. However when both keys are
null, the method returns *false* (meaning not-equal; i.e. it is a new key),
thus the new key is added into the list following the old key. When a third
null key is found, it would be matched with the prior two, and added as well.
Etc etc ...
This way many null values would perform checks at order N^2 / 2.
Suggested improvement: The generated code should return a third result, meaning
"two null keys". Then in case of Inner or Left joins all the duplicate nulls
can be discarded.
Below is a simple example, note the time difference between non-null and the
all-nulls tables (also instrumentation showed that for nulls, the method above
was called 1249975000 times!!)
{code:java}
0: jdbc:drill:zk=local> use dfs.tmp;
0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol
from
dfs.`/data/test128M.tbl` limit 50000);
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1
from
dfs.`/data/test128M.tbl` limit 60000);
0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2
from dfs.`/data/test128M.tbl` limit 50000);
0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 =
test2.mycol2;
+---------+
| EXPR$0 |
+---------+
| 0 |
+---------+
1 row selected (0.443 seconds)
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1
from dfs.`/data/test128M.tbl` limit 60000);
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 60000 |
+-----------+----------------------------+
1 row selected (0.517 seconds)
0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 =
test.mycol;
+---------+
| EXPR$0 |
+---------+
| 0 |
+---------+
1 row selected (140.098 seconds)
{code}
> Hash-Join: Many null keys on the build side form a long linked chain in the
> Hash Table
> --------------------------------------------------------------------------------------
>
> Key: DRILL-6880
> URL: https://issues.apache.org/jira/browse/DRILL-6880
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Relational Operators
> Affects Versions: 1.14.0
> Reporter: Boaz Ben-Zvi
> Assignee: Boaz Ben-Zvi
> Priority: Critical
> Fix For: 1.16.0
>
>
> When building the Hash Table for the Hash-Join, each new key is matched with
> an existing key (same bucket) by calling the generated method
> `isKeyMatchInternalBuild`, which compares the two. However when both keys are
> null, the method returns *false* (meaning not-equal; i.e. it is a new key),
> thus the new key is added into the list following the old key. When a third
> null key is found, it would be matched with the prior two, and added as well.
> Etc etc ...
> This way many null values would perform checks at order N^2 / 2.
> _Suggested improvement_: The generated code should return a third result,
> meaning "two null keys". Then in case of Inner or Left joins all the
> duplicate nulls can be discarded.
> Below is a simple example, note the time difference between non-null and the
> all-nulls tables (also instrumentation showed that for nulls, the method
> above was called 1249975000 times!!)
> {code:java}
> 0: jdbc:drill:zk=local> use dfs.tmp;
> 0: jdbc:drill:zk=local> create table testNull as (select cast(null as int)
> mycol from
> dfs.`/data/test128M.tbl` limit 50000);
> 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1
> from
> dfs.`/data/test128M.tbl` limit 60000);
> 0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2
> from dfs.`/data/test128M.tbl` limit 50000);
> 0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1
> = test2.mycol2;
> +---------+
> | EXPR$0 |
> +---------+
> | 0 |
> +---------+
> 1 row selected (0.443 seconds)
> 0: jdbc:drill:zk=local> select count(*) from test1 join testNull on
> test1.mycol1 = testNull.mycol;
> +---------+
> | EXPR$0 |
> +---------+
> | 0 |
> +---------+
> 1 row selected (140.098 seconds)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)