Boaz Ben-Zvi created DRILL-6880:
-----------------------------------

             Summary: Hash-Join: Many null keys on the build side form a long 
linked chain in the Hash Table
                 Key: DRILL-6880
                 URL: https://issues.apache.org/jira/browse/DRILL-6880
             Project: Apache Drill
          Issue Type: Improvement
          Components: Execution - Relational Operators
    Affects Versions: 1.14.0
            Reporter: Boaz Ben-Zvi
             Fix For: 1.16.0


When building the Hash Table for the Hash-Join, each new key is matched with an 
existing key (same bucket) by calling the generated method 
`isKeyMatchInternalBuild`, which compares the two. However when both keys are 
null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
thus the new key is added into the list following the old key. When a third 
null key is found, it would be matched with the prior two, and added as well. 
Etc etc ...

This way many null values would perform checks at order N^2 / 2.

Suggested improvement: The generated code should return a third result, meaning 
"two null keys". Then in case of Inner or Left joins all the duplicate nulls 
can be discarded.

Below is a simple example, note the time difference between non-null and the 
all-nulls tables (also instrumentation showed that for nulls, the method above 
was called 1249975000 times!!)
{code:java}
0: jdbc:drill:zk=local> use dfs.tmp;
0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol 
from 
 dfs.`/data/test128M.tbl` limit 50000);
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
from 
 dfs.`/data/test128M.tbl` limit 60000);
0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 
from dfs.`/data/test128M.tbl` limit 50000);
0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 = 
test2.mycol2;
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (0.443 seconds)
0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
from dfs.`/data/test128M.tbl` limit 60000);
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 60000                      |
+-----------+----------------------------+
1 row selected (0.517 seconds)
0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 = 
test.mycol;
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (140.098 seconds)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to