[jira] [Commented] (HIVE-3286) Explicit skew join on user provided condition

Hive QA (JIRA) Mon, 16 Dec 2013 04:03:42 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849088#comment-13849088
 ]


Hive QA commented on HIVE-3286:
-------------------------------



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12618867/HIVE-3286.16.patch.txt

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 4790 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_skewjoin_explicit
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_skewjoin_explicit_invalid1
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_skewjoin_explicit_invalid2
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_skewjoin_explicit_invalid3
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/652/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/652/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12618867

> Explicit skew join on user provided condition
> ---------------------------------------------
>
>                 Key: HIVE-3286
>                 URL: https://issues.apache.org/jira/browse/HIVE-3286
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>         Attachments: D4287.11.patch, HIVE-3286.12.patch.txt, 
> HIVE-3286.13.patch.txt, HIVE-3286.14.patch.txt, HIVE-3286.15.patch.txt, 
> HIVE-3286.16.patch.txt, HIVE-3286.D4287.10.patch, HIVE-3286.D4287.5.patch, 
> HIVE-3286.D4287.6.patch, HIVE-3286.D4287.7.patch, HIVE-3286.D4287.8.patch, 
> HIVE-3286.D4287.9.patch
>
>
> Join operation on table with skewed data takes most of execution time 
> handling the skewed keys. But mostly we already know about that and even know 
> what is look like the skewed keys.
> If we can explicitly assign reducer slots for the skewed keys, total 
> execution time could be greatly shortened.
> As for a start, I've extended join grammar something like this.
> {code}
> select * from src a join src b on a.key=b.key skew on (a.key+1 < 50, a.key+1 
> < 100, a.key < 150);
> {code}
> which means if above query is executed by 20 reducers, one reducer for 
> a.key+1 < 50, one reducer for 50 <= a.key+1 < 100, one reducer for 99 <= 
> a.key < 150, and 17 reducers for others (could be extended to assign more 
> than one reducer later)
> This can be only used with common-inner-equi joins. And skew condition should 
> be composed of join keys only.
> Work till done now will be updated shortly after code cleanup.
> ----------------------------
> Skew expressions* in "SKEW ON (expr, expr, ...)" are evaluated sequentially 
> at runtime, and first 'true' one decides skew group for the row. Each skew 
> group has reserved partition slot(s), to which all rows in a group would be 
> assigned. 
> The number of partition slot reserved for each group is decided also at 
> runtime by simple calculation of percentage. If a skew group is "CLUSTER BY 
> 20 PERCENT" and total partition slot (=number of reducer) is 20, that group 
> will reserve 4 partition slots, etc.
> "DISTRIBUTE BY" decides how the rows in a group is dispersed in the range of 
> reserved slots (If there is only one slot for a group, this is meaningless). 
> Currently, three distribution policies are available: RANDOM, KEYS, 
> <expression>. 
> 1. RANDOM : rows of driver** alias are dispersed by random and rows of 
> non-driver alias are duplicated for all the slots (default if not specified)
> 2. KEYS : determined by hash value of keys (same with previous)
> 3. expression : determined by hash of object evaluated by user-provided 
> expression
> Only possible with inner, equi, common-joins. Not yet supports join tree 
> merging.
> Might be used by other RS users like "SORT BY" or "GROUP BY"
> If there exists column statistics for the key, it could be possible to apply 
> automatically.
> For example, if 20 reducers are used for the query below,
> {code}
> select count(*) from src a join src b on a.key=b.key skew on (
>    a.key = '0' CLUSTER BY 10 PERCENT,
>    b.key < '100' CLUSTER BY 20 PERCENT DISTRIBUTE BY upper(b.key),
>    cast(a.key as int) > 300 CLUSTER BY 40 PERCENT DISTRIBUTE BY KEYS);
> {code}
> group-0 will reserve slots 6~7, group-1 8~11, group-2 12~19 and others will 
> reserve slots 0~5.
> For a row with key='0' from alias a, the row is randomly assigned in the 
> range of 6~7 (driver alias) : 6 or 7
> For a row with key='0' from alias b, the row is disributed for all slots in 
> 6~7 (non-driver alias) : 6 and 7
> For a row with key='50', the row is assigned in the range of 8~11 by hashcode 
> of upper(b.key) : 8 + (hash(upper(key)) % 4)
> For a row with key='500', the row is assigned in the range of 12~19 by 
> hashcode of join key : 12 + (hash(key) % 8)
> For a row with key='200', this is not belong to any skew group : hash(key) % 6
> *expressions in skew condition : 
> 1. all expressions should be made of expression in join condition, which 
> means if join condition is "a.key=b.key", user can make any expression with 
> "a.key" or "b.key". But if join condition is a.key+1=b.key, user cannot make 
> expression with "a.key" solely (should make expression with "a.key+1"). 
> 2. all expressions should reference one and only-one side of aliases. For 
> example, simple constant expressions or expressions referencing both side of 
> join condition ("a.key+b.key<100") is not allowed.
> 3. all functions in expression should be deteministic and stateless.
> 4. if "DISTRIBUTED BY expression" is used, distibution expression also should 
> have same alias with skew expression.
> **driver alias :
> 1. driver alias means the sole referenced alias from skew expression, which 
> is important for RANDOM distribution. rows of driver alias are assigned to 
> single slot randomly, but rows of non-driver alias are duplicated for all the 
> slots. So, driver alias should be the biggest one in join aliases.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HIVE-3286) Explicit skew join on user provided condition

Reply via email to