[Hadoop Wiki] Update of "Hive/Design" by JeffHammerbacher

Apache Wiki Wed, 21 Jan 2009 20:11:57 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by JeffHammerbacher:
http://wiki.apache.org/hadoop/Hive/Design

------------------------------------------------------------------------------
   * Tables - These are analogous to Tables in Relational Databases. Tables can 
be filtered, projected, joined and unioned. Additionally all the data of a 
table is stored in a directory in hdfs. Hive also supports notion of external 
tables wherein a table can be created on prexisting files or directories in 
hdfs by providing the appropriate location to the table creation DDL. The rows 
in a table are organized into typed columns similar to Relational Databases.
   * Partitions - Each Table can have one or more partition keys which 
determine how the data is stored e.g. a table T with a date partition column ds 
had files with data for a particular date stored in the <table 
location>/ds=<date> directory in hdfs. Partitions allow the system to prune 
data to be inspected based on query predicates, e.g. a query that in interested 
in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have 
to look at files in <table location>/ds=2008-09-01/ directory in hdfs.
   * Buckets - Data in each partition may in turn be divided into Buckets based 
on the hash of a column in the table. Each bucket is stored as a file in the 
partition directory. Bucketing allows the system to efficiently evaluate 
queries that depend on a sample of data (these are queries that use SAMPLE 
clause on the table).
- \end{itemize}
  
  Apart from primitive column types(integers, floating point numbers, generic 
strings, dates and booleans), Hive also supports arrays and maps. Additionally, 
users can compose their own types programatically from any of the primitives, 
collections or other user defined types. The typing system is closely tied to 
the serde(Serailization/Deserialization) and object inspector interfaces. User 
can create their own types by implementing their own object inspectors and 
using these object inspectors they can create their own serdes to serialize and 
deserialize their data into hdfs files). These two interfaces provide the 
necessary hooks to extend the capabilities of Hive when it comes to 
understanding other data formats and richer types. Builtin object inspectors 
like ListObjectInspector, StructObjectInspector and MapObjectInspector provide 
the necessary primitives to compose richer types in an extensible manner. For 
maps(associative arrays) and arrays useful builtin functions like 
 size and index operators are provided. The dotted notation is used to navigate 
nested types e.g. a.b.c = 1 looks at field c of field b of type a and compares 
that with 1.

[Hadoop Wiki] Update of "Hive/Design" by JeffHammerbacher

Reply via email to