[ https://issues.apache.org/jira/browse/HIVE-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717840#comment-13717840 ]
Edward Capriolo commented on HIVE-4660: --------------------------------------- Thanks for uploading that. I am still getting up to speed a bit, silly question: I am looking through the tez source code and attempting to understand it's basic optimizations. I am looking at GroupByOrderByMRRTest. /** * Simple example that does a GROUP BY ORDER BY in an MRR job * Consider a query such as * Select DeptName, COUNT(*) as cnt FROM EmployeeTable * GROUP BY DeptName ORDER BY cnt; I notice that this test essentially runs the job single reducer. job.setNumReduceTasks(1); /** * Shuffle ensures ordering based on count of employees per department * hence the final reducer is a no-op and just emits the department name * with the employee count per department. */ What mechanism makes the above optimization happen? Do all shuffles have a natural total order sort with Tez? > Let there be Tez > ---------------- > > Key: HIVE-4660 > URL: https://issues.apache.org/jira/browse/HIVE-4660 > Project: Hive > Issue Type: New Feature > Reporter: Gunther Hagleitner > Assignee: Gunther Hagleitner > > Tez is a new application framework built on Hadoop Yarn that can execute > complex directed acyclic graphs of general data processing tasks. Here's the > project's page: http://incubator.apache.org/projects/tez.html > The interesting thing about Tez from Hive's perspective is that it will over > time allow us to overcome inefficiencies in query processing due to having to > express every algorithm in the map-reduce paradigm. > The barrier to entry is pretty low as well: Tez can actually run unmodified > MR jobs; But as a first step we can without much trouble start using more of > Tez' features by taking advantage of the MRR pattern. > MRR simply means that there can be any number of reduce stages following a > single map stage - without having to write intermediate results to HDFS and > re-read them in a new job. This is common when queries require multiple > shuffles on keys without correlation (e.g.: join - grp by - window function - > order by) > For more details see the design doc here: > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira