Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/Design

------------------------------------------------------------------------------
+ [[TableOfContents]]
  == Hive Architecture ==
  Figure \ref{fig:sys_arch} shows the major components of Hive and its 
interactions with Hadoop. As shown in that figure, the main components of Hive 
are: 
   * UI - The user interface for users to submit queries and other operations 
to the system. Currently the system has a command line interface and a web 
based GUI is being developed.
@@ -20, +21 @@

  
  Apart from primitive column types(integers, floating point numbers, generic 
strings, dates and booleans), Hive also supports arrays and maps. Additionally, 
users can compose their own types programatically from any of the primitives, 
collections or other user defined types. The typing system is closely tied to 
the serde(Serailization/Deserialization) and object inspector interfaces. User 
can create their own types by implementing their own object inspectors and 
using these object inspectors they can create their own serdes to serialize and 
deserialize their data into hdfs files). These two interfaces provide the 
necessary hooks to extend the capabilities of Hive when it comes to 
understanding other data formats and richer types. Builtin object inspectors 
like ListObjectInspector, StructObjectInspector and MapObjectInspector provide 
the necessary primitives to compose richer types in an extensible manner. For 
maps(associative arrays) and arrays useful builtin functions like 
 size and index operators are provided. The dotted notation is used to navigate 
nested types e.g. a.b.c = 1 looks at field c of field b of type a and compares 
that with 1.
  
- == Meta Store ==
+ == Metastore ==
  === Motivation ===
  Meta Store store provides two important but often over looked features of a 
data warehouse: data abstraction and data discovery. Without the data 
abstractions provided in Hive, user has to provide information about data 
formats, exractors and loaders along with the query. In Hive, this information 
given during table creation and reused everytime the table is referenced. This 
is very similar to the traditional warehousing systems. The second 
functionality, data discovery, enables users to discover and explore relevant 
and specific data in the warehouse. Other tools can be built using this 
metadata to expose and possibly enhance the information about the data and its 
availability. Hive accomplishes both of these features by providing a metdata 
repository that is tightly integrated with the Hive query processing system so 
that data and metadata are in sync.
  
@@ -29, +30 @@

   * Table - Metadata for table contains list of columns, owner, storage and 
SerDe information. It can also contain any user supplied key and value data. 
Storage information includes location of the underlying data, file inout and 
output formats and bucketing information. SerDe metadata includes the 
implementation class of serializer and deserializer and any supporting 
information required by the implementation. All of these information can be 
provided during the creation of table.
   * Partition - Each partition can have its own columns and SerDe and storage 
information. This facilitates schema changes without affecting older partitions.
  
- === Meta Store Architecture ===
+ === Metastore Architecture ===
  Metastore is an object store with a database or file backed store. The 
database backed store is implemented using ORM solution\cite{jpox}. The prime 
motivation for storting this in a relational database is queriability of metad 
data. Some disadvantages of using a separate data store for metadata instead 
using HDFS are synchronization and scalability issues. Additionally there is no 
clear way to implement an object store on top of HDFS due to lack of random 
updates to files. Coupled with this and the advantages of queriability of 
relational store made our approach a sensible one.
  Meta Store can be configured to be used in couple of ways: remote and 
embedded. In remote mode, meta store is a Thrift\cite{thrift} service. This 
mode is useful for non-Java clients. In embedded mode, Hive client directly 
connects to underlying meta store using JDBC. This mode is useful because it 
avoids another system that needs to be maintained and monitored. Both of these 
modes can co-exist.
  
+ === Metastore Interface ===
+ Metastore provides Thrift interface\cite{msapi} to manipulate and query Hive 
metadata. Thrift provides bindings in many popular languages. Third party tools 
can use this interface to integrate Hive metadata into other business metadata 
repositories.
+ 
+ == Hive Query Language ==
+ HiveQL is an SQL-like query language for Hive. It mostly mimics SQL syntax 
for creation
+ of tables, loading data into tables and querying the tables. HiveQL also 
allows
+ users to embed their custom map-reduce scripts. These scripts can be written 
in any language
+ using a simple row-based streaming interface -- read rows from standard input 
and write out
+ rows to standard output. This flexibility comes at a cost of a performance 
hit caused by
+ converting rows from and to strings. However, we have seen that users do not 
mind this given
+ that they can implement their scripts in the language of their choice. 
Another feature
+ unique to HiveQL is multi-table insert. In this construct, users can perform 
multiple queries
+ on the same input data using a single HiveQL query. Hive optimizes these 
queries to share
+ the scan of the input data, thus increasing the throughput of these queries 
several orders
+ of magnitude. We omit more details due to lack of space.  For a more complete
+ description of the HiveQL language see the [wiki:Self:Hive/LanguageManual 
language manual].
+ 

Reply via email to