Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by NamitJain:
http://wiki.apache.org/hadoop/Hive/Design

------------------------------------------------------------------------------
  == Compiler ==
   * Parser - Transform query string to a parse tree representation
   * Semantic Analyser - Transform the parse tree to an internal query 
representation, which is still block based and not an operator tree. As part of 
this step, the column names are verified and expansions like * are performed. 
Type-checking and any implicit type conversions are also performed at this 
stage. If the table under consideration is a partitioned table, which is the 
common scenario, all the expressions for that table are collected so that they 
can be later used to prune the partitions which are not needed. If the query 
has specified sampling, that is also collected to be used later on.
-  * Logical Plan Generator - Convert the internal query representation to a 
logical plan, which consists of a tree of operators. Some of the operators are 
relational algebra operators like 'filter', 'join' etc. But some of the 
operators are hive specific and are used later on to convert this plan into a 
series of map-reduce jobs. One such operator is a ReduceSink operator which 
occurs at the map-reduce boundary. This step also includes the optimizer to 
transform the plan to improve performance - some of those transformations 
include: converting a series of joins into a single multi-way join, performing 
a map-side partial aggregation for a group-by, performing a group-by in 2 
stages to avoid the scenario when a single reducer can become a bottleneck in 
presence of skewed data for the grouping key. Each operator comprises of a 
descriptor which is a serializable object.
+  * Logical Plan Generator - Convert the internal query representation to a 
logical plan, which consists of a tree of operators. Some of the operators are 
relational algebra operators like 'filter', 'join' etc. But some of the 
operators are hive specific and are used later on to convert this plan into a 
series of map-reduce jobs. One such operator is a reduceSink operator which 
occurs at the map-reduce boundary. This step also includes the optimizer to 
transform the plan to improve performance - some of those transformations 
include: converting a series of joins into a single multi-way join, performing 
a map-side partial aggregation for a group-by, performing a group-by in 2 
stages to avoid the scenario when a single reducer can become a bottleneck in 
presence of skewed data for the grouping key. Each operator comprises of a 
descriptor which is a serializable object.
   * Query Plan Generator - Convert the logical plan to a series of map-reduce 
tasks. The operator tree is recursively traversed, to be broken up into a 
series of map-reduce serializable tasks which can be submitted later on to the 
map-reduce framework for the hadoop distributed file system. The reduceSink 
operator is the map-reduce boundary, whose descriptor contains the reduction 
keys. The reduction keys in the reduceSink descriptor are used to as the 
reduction keys in the map-reduce boundary. The plan only consists of the 
required samples/partitions if the query specified so.
  
  

Reply via email to