[jira] [Commented] (DRILL-6880) Hash-Join: Many null keys on the build side form a long linked chain in the Hash Table

Boaz Ben-Zvi (JIRA) Tue, 04 Dec 2018 20:51:02 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709594#comment-16709594
 ]


Boaz Ben-Zvi commented on DRILL-6880:
-------------------------------------

Another "side issue" is the execution time of the generated method 
_isKeyMatchInternalBuild_ (which compares the keys, checks null conditions, 
NaN, etc). Normally that invocation time is less than a Micro-second, but on 
some occasions we've seen it go up to several Mili-Seconds (up to 117,000 uSec 
!!). 
 A big contributor to that is the Hotspot compiler, which executes in a 
"server" mode (i.e., spends more time optimizing, for a better result). When 
switched to a "client" mode (by adding the "-client" option to 
*DRILL_JAVA_OPTS*), those abnormal times diminished significantly.

> Hash-Join: Many null keys on the build side form a long linked chain in the 
> Hash Table
> --------------------------------------------------------------------------------------
>
>                 Key: DRILL-6880
>                 URL: https://issues.apache.org/jira/browse/DRILL-6880
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.14.0
>            Reporter: Boaz Ben-Zvi
>            Priority: Major
>             Fix For: 1.16.0
>
>
> When building the Hash Table for the Hash-Join, each new key is matched with 
> an existing key (same bucket) by calling the generated method 
> `isKeyMatchInternalBuild`, which compares the two. However when both keys are 
> null, the method returns *false* (meaning not-equal; i.e. it is a new key), 
> thus the new key is added into the list following the old key. When a third 
> null key is found, it would be matched with the prior two, and added as well. 
> Etc etc ...
> This way many null values would perform checks at order N^2 / 2.
> Suggested improvement: The generated code should return a third result, 
> meaning "two null keys". Then in case of Inner or Left joins all the 
> duplicate nulls can be discarded.
> Below is a simple example, note the time difference between non-null and the 
> all-nulls tables (also instrumentation showed that for nulls, the method 
> above was called 1249975000 times!!)
> {code:java}
> 0: jdbc:drill:zk=local> use dfs.tmp;
> 0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol 
> from 
>  dfs.`/data/test128M.tbl` limit 50000);
> 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
> from 
>  dfs.`/data/test128M.tbl` limit 60000);
> 0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 
> from dfs.`/data/test128M.tbl` limit 50000);
> 0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 
> = test2.mycol2;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> 1 row selected (0.443 seconds)
> 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 
> from dfs.`/data/test128M.tbl` limit 60000);
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 60000                      |
> +-----------+----------------------------+
> 1 row selected (0.517 seconds)
> 0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 
> = test.mycol;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> 1 row selected (140.098 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6880) Hash-Join: Many null keys on the build side form a long linked chain in the Hash Table

Reply via email to