[ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401939#comment-13401939 ]
alex gemini commented on HIVE-3086: ----------------------------------- for a big table logs(userid,region,timestamps,url) which has more than 10 billion record,a middle size table users(userid,age) which has 10 million records, if there is a query : select count(userid) from logs a ,users b where a.userid=b.userid group by b.age. let's say age 18-25 have more than 50% of total records and age 40-60 have only 5% of records, age 25-50 have rest. what we defined skewed is always by our query ,in this case skewed key is age,we can't always assume two table are skewed by join key,right? another example : select count(userid),to_date(timestamps,'YYYYMMDD'),age from logs where timestamps > 2011-12-01 and timestamps < 2011-12-31 and age<25 and age>18. because the Christmas,records in 2011-12-25 to 2011-12-31 maybe have more records than other day in this month(this query particular assume age is not skewed for the purpose discussion). since hive user hash partition ,let's say 6 reduce,then 2011-12-24 and 2011-12-30 will go into same reduce which cause one reduce process much more records than others. > Skewed Join Optimization > ------------------------ > > Key: HIVE-3086 > URL: https://issues.apache.org/jira/browse/HIVE-3086 > Project: Hive > Issue Type: New Feature > Reporter: Nadeem Moidu > Assignee: Nadeem Moidu > > During a join operation, if one of the columns has a skewed key, it can cause > that particular reducer to become the bottleneck. The following feature will > address it: > https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira