Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by NamitJain: http://wiki.apache.org/hadoop/Hive/Design ------------------------------------------------------------------------------ * Parser - Transform query string to a parse tree representation * Semantic Analyser - Transform the parse tree to an internal query representation, which is still block based and not an operator tree. As part of this step, the column names are verified and expansions like * are performed. Type-checking and any implicit type conversions are also performed at this stage. If the table under consideration is a partitioned table, which is the common scenario, all the expressions for that table are collected so that they can be later used to prune the partitions which are not needed. If the query has specified sampling, that is also collected to be used later on. * Logical Plan Generator - Convert the internal query representation to a logical plan, which consists of a tree of operators. Some of the operators are relational algebra operators like 'filter', 'join' etc. But some of the operators are hive specific and are used later on to convert this plan into a series of map-reduce jobs. One such operator is a ReduceSink operator which occurs at the map-reduce boundary. This step also includes the optimizer to transform the plan to improve performance - some of those transformations include: converting a series of joins into a single multi-way join, performing a map-side partial aggregation for a group-by, performing a group-by in 2 stages to avoid the scenario when a single reducer can become a bottleneck in presence of skewed data for the grouping key. Each operator comprises of a descriptor which is a serializable object. - * Query Plan Generator - Convert the logical plan to a series of map-reduce tasks. The operator tree is recursively traversed, to be broken up into a series of map-reduce serializable tasks which can be submitted later on to the map-reduce framework for the hadoop distributed file system. The ReduceSink operator is the map-reduce boundary, whose descriptor contains the reduction keys. The reduction keys in the ReduceSink descriptor are used to as the reduction keys in the map-reduce boundary. The plan only consists of the required samples/partitions if the query specified so. + * Query Plan Generator - Convert the logical plan to a series of map-reduce tasks. The operator tree is recursively traversed, to be broken up into a series of map-reduce serializable tasks which can be submitted later on to the map-reduce framework for the hadoop distributed file system. The reduceSink operator is the map-reduce boundary, whose descriptor contains the reduction keys. The reduction keys in the reduceSink descriptor are used to as the reduction keys in the map-reduce boundary. The plan only consists of the required samples/partitions if the query specified so. == Optimizer ==
