[ 
https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401939#comment-13401939
 ] 

alex gemini commented on HIVE-3086:
-----------------------------------

for a big table logs(userid,region,timestamps,url) which has more than 10 
billion record,a middle size table users(userid,age) which has 10 million 
records, if there is a query :
 select count(userid) from logs a ,users b where a.userid=b.userid group by 
b.age.
let's say age 18-25 have more than 50% of total records and age 40-60 have only 
5% of records, age 25-50 have rest.
what we defined skewed is always by our query ,in this case skewed key is 
age,we can't always assume two table are skewed by join key,right?
another example : select count(userid),to_date(timestamps,'YYYYMMDD'),age from 
logs where timestamps > 2011-12-01 and timestamps < 2011-12-31 and age<25 and 
age>18.
because the Christmas,records in 2011-12-25 to 2011-12-31 maybe have more 
records than other day in this month(this query particular assume age is not 
skewed for the purpose discussion).
since hive user hash partition ,let's say 6 reduce,then 2011-12-24 and 
2011-12-30 will go into same reduce which cause one reduce process much more 
records than others.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause 
> that particular reducer to become the bottleneck. The following feature will 
> address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to