[Hadoop Wiki] Trivial Update of "Hive/Design" by RaghothamMurthy

Apache Wiki Thu, 22 Jan 2009 00:16:39 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/Design

------------------------------------------------------------------------------
  [[TableOfContents]]
  == Hive Architecture ==
- Figure \ref{fig:sys_arch} shows the major components of Hive and its 
interactions with Hadoop. As shown in that figure, the main components of Hive 
are: 
+ Figure 1 shows the major components of Hive and its interactions with Hadoop. 
As shown in that figure, the main components of Hive are: 
   * UI - The user interface for users to submit queries and other operations 
to the system. Currently the system has a command line interface and a web 
based GUI is being developed.
   * Driver - The component which receives the queries. This component 
implements the notion of session handles and provides execute and fetch APIs 
modeled on JDBC/ODBC interfaces.
   * Compiler - The component that parses the query, does semantic analysis on 
the different qurey blocks and query expressions and eventually generates an 
execution plan with the help of the table and partition metadata looked up from 
the metastore.
   * Metastore - The component that stores all the structure information of the 
various table and partitions in the warehouse including column and column type 
information, the serializers and deserializers necessary to read and write data 
and the corresponding hdfs files where the data is stored.
   * Execution Engine - The component which executes the execution plan created 
by the compiler. The plan is a DAG of stages. The execution engine manages the 
dependencies between these different stages of the plan and executes these 
stages on the appropriate system components.
  
- Figure \ref{fig:sys_arch} also shows how a typical query flows through the 
system. The UI calls the execute interface to the Driver(step 1 in Figure 
\ref{fig:sys_arch}). The Driver creates a session handle for the query and 
sends the query to the compiler to generate an execution plan(step 2). The 
compiler gets the necessary metadata from the metastore(steps 3 and 4). This 
metadata is used to typecheck the expressions in the query tree as well as to 
prune partitions based on query predicates. The plan generated by the 
compiler(step 5) is a DAG of stages with each stage being either a map/reduce 
job, a metadata operation or an operations on hdfs. For map/reduce stages, the 
plan contains map operator trees(operator trees that are executed on the 
mappers) and a reduce operator tree(for operations that need reducers). The 
execution engines submits these stages to appropriate components(steps 6, 6.1, 
6.2 and 6.3 steps). In each task(mapper/reducer) the deserializer associated wi
 th the table or intermediate outputs is used to read the rows from hdfs files 
and these are passed through the associated operator tree. Once the output is 
generated, it is written to a temporary hdfs file though the serializer(this 
happens in the mapper in case the operation does not need a reduce). The 
temporary files are used to provide data to subsequent map/reduce stages of the 
plan. For DML operations the final temporary file is moved to the tables 
location. This scheme is used to ensure that dirty data is not read(file rename 
being an atomic operation in hdfs). For queries, the contents of the temporary 
file are read by the execution engine directly from hdfs as part of the fetch 
call from the Driver(steps 7, 8 and 9).
+ Figure 1 also shows how a typical query flows through the system. The UI 
calls the execute interface to the Driver(step 1 in Figure 1). The Driver 
creates a session handle for the query and sends the query to the compiler to 
generate an execution plan(step 2). The compiler gets the necessary metadata 
from the metastore(steps 3 and 4). This metadata is used to typecheck the 
expressions in the query tree as well as to prune partitions based on query 
predicates. The plan generated by the compiler(step 5) is a DAG of stages with 
each stage being either a map/reduce job, a metadata operation or an operations 
on hdfs. For map/reduce stages, the plan contains map operator trees(operator 
trees that are executed on the mappers) and a reduce operator tree(for 
operations that need reducers). The execution engines submits these stages to 
appropriate components(steps 6, 6.1, 6.2 and 6.3 steps). In each 
task(mapper/reducer) the deserializer associated with the table or intermediate 
outpu
 ts is used to read the rows from hdfs files and these are passed through the 
associated operator tree. Once the output is generated, it is written to a 
temporary hdfs file though the serializer(this happens in the mapper in case 
the operation does not need a reduce). The temporary files are used to provide 
data to subsequent map/reduce stages of the plan. For DML operations the final 
temporary file is moved to the tables location. This scheme is used to ensure 
that dirty data is not read(file rename being an atomic operation in hdfs). For 
queries, the contents of the temporary file are read by the execution engine 
directly from hdfs as part of the fetch call from the Driver(steps 7, 8 and 9).
  
  
  == Hive Data Model ==

[Hadoop Wiki] Trivial Update of "Hive/Design" by RaghothamMurthy

Reply via email to