Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by JeffHammerbacher: http://wiki.apache.org/hadoop/Hive/Design ------------------------------------------------------------------------------ * Tables - These are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in hdfs. Hive also supports notion of external tables wherein a table can be created on prexisting files or directories in hdfs by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases. * Partitions - Each Table can have one or more partition keys which determine how the data is stored e.g. a table T with a date partition column ds had files with data for a particular date stored in the <table location>/ds=<date> directory in hdfs. Partitions allow the system to prune data to be inspected based on query predicates, e.g. a query that in interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in <table location>/ds=2008-09-01/ directory in hdfs. * Buckets - Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use SAMPLE clause on the table). - \end{itemize} Apart from primitive column types(integers, floating point numbers, generic strings, dates and booleans), Hive also supports arrays and maps. Additionally, users can compose their own types programatically from any of the primitives, collections or other user defined types. The typing system is closely tied to the serde(Serailization/Deserialization) and object inspector interfaces. User can create their own types by implementing their own object inspectors and using these object inspectors they can create their own serdes to serialize and deserialize their data into hdfs files). These two interfaces provide the necessary hooks to extend the capabilities of Hive when it comes to understanding other data formats and richer types. Builtin object inspectors like ListObjectInspector, StructObjectInspector and MapObjectInspector provide the necessary primitives to compose richer types in an extensible manner. For maps(associative arrays) and arrays useful builtin functions like size and index operators are provided. The dotted notation is used to navigate nested types e.g. a.b.c = 1 looks at field c of field b of type a and compares that with 1.
