[jira] [Commented] (DRILL-1987) join with tons of duplicates hangs with hash join

Aman Sinha (JIRA) Tue, 13 Jan 2015 17:52:20 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276363#comment-14276363
 ]


Aman Sinha commented on DRILL-1987:
-----------------------------------

Correcting my previous comment about 'hash join itself completes relatively 
quickly'... actually the probe side of the hash join pipelined with the 
downstream operators (in this case a hash partition and sort), so the hash join 
completion time is dependent on how fast the output records can be consumed by 
downstream operator. 

The actual amount of 'work' for both HJ and MJ should be equivalent since both 
need to produce the same number of output rows, so I worked with Chun to 
confirm the suspicion about wrong results with MJ and that will be addressed by 
DRILL-2010 .

For this particular bug I think we should try to create a test case independent 
of joins, just doing an ORDER BY and LIMIT for a few tens of billions of 
records.  That will clearly isolate the issues (although it requires 
significant disk storage). 

> join with tons of duplicates hangs with hash join
> -------------------------------------------------
>
>                 Key: DRILL-1987
>                 URL: https://issues.apache.org/jira/browse/DRILL-1987
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 0.8.0
>            Reporter: Chun Chang
>            Assignee: Aman Sinha
>
> #Fri Jan 09 20:39:31 EST 2015
> git.commit.id.abbrev=487d98e
> With hash join enabled (default), the following join query hangs (running for 
> about 30 min now). The join condition has mostly duplicates. Each table has 1 
> million rows. Data can be downloaded here:
> https://s3.amazonaws.com/apache-drill/files/complex.json.gz
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_mergejoin` = false;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.025 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_hashjoin` = true;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.045 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select a.id, b.gbyi, a.str 
> from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by 
> a.id limit 20;
> +------------+------------+------------+
> |     id     |    gbyi    |    str     |
> +------------+------------+------------+
> {code}
> physical plan:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, 
> b.gbyi, a.str from `complex.json` a inner join `complex.json` b on 
> a.gbyi=b.gbyi order by a.id limit 20;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(id=[$0], gbyi=[$1], str=[$2])
> 00-02        SelectionVectorRemover
> 00-03          Limit(fetch=[20])
> 00-04            SingleMergeExchange(sort0=[0 ASC])
> 01-01              SelectionVectorRemover
> 01-02                TopN(limit=[20])
> 01-03                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    Project(id=[$1], gbyi=[$3], str=[$2])
> 02-02                      HashJoin(condition=[=($0, $3)], joinType=[inner])
> 02-04                        HashToRandomExchange(dist0=[[$0]])
> 03-01                          Project(gbyi=[$0], id=[$2], str=[$1])
> 03-02                            Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, 
> columns=[`gbyi`, `id`, `str`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> 02-03                        Project(gbyi0=[$0])
> 02-05                          HashToRandomExchange(dist0=[[$0]])
> 04-01                            Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, 
> columns=[`gbyi`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}
> If I turn merge join on, the query finishes rather quickly, like within a 
> minute.
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_hashjoin` = false;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.026 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set 
> `planner.enable_mergejoin` = true;
> +------------+------------+
> |     ok     |  summary   |
> +------------+------------+
> | true       | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, 
> b.gbyi, a.str from `complex.json` a inner join `complex.json` b on 
> a.gbyi=b.gbyi order by a.id limit 20;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(id=[$0], gbyi=[$1], str=[$2])
> 00-02        SelectionVectorRemover
> 00-03          Limit(fetch=[20])
> 00-04            SingleMergeExchange(sort0=[0 ASC])
> 01-01              SelectionVectorRemover
> 01-02                TopN(limit=[20])
> 01-03                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    Project(id=[$1], gbyi=[$3], str=[$2])
> 02-02                      MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 02-04                        SelectionVectorRemover
> 02-06                          Sort(sort0=[$0], dir0=[ASC])
> 02-08                            HashToRandomExchange(dist0=[[$0]])
> 03-01                              Project(gbyi=[$0], id=[$2], str=[$1])
> 03-02                                Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, 
> columns=[`gbyi`, `id`, `str`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> 02-03                        Project(gbyi0=[$0])
> 02-05                          SelectionVectorRemover
> 02-07                            Sort(sort0=[$0], dir0=[ASC])
> 02-09                              HashToRandomExchange(dist0=[[$0]])
> 04-01                                Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, 
> columns=[`gbyi`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-1987) join with tons of duplicates hangs with hash join

Reply via email to