[Hadoop Wiki] Update of "Hive/DeveloperGuide" by ZhengShao

Apache Wiki Tue, 16 Dec 2008 14:52:40 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/DeveloperGuide

------------------------------------------------------------------------------
   * trunk/testutils (Deprecated)
  
  === SerDe ===
+ What is SerDe
+   * SerDe is a short name for Serializer and Deserializer.
+   * Hive uses SerDe (and FileFormat) to read from/write to tables.
+   * HDFS files --(InputFileFormat)--> <key, value> --(Deserializer)--> Row 
object
+   * Row object --(Serializer)--> <key, value> --(OutputFileFormat)--> HDFS 
files
+ 
+ Note that the "key" part is ignored when reading, and is always a constant 
when writing. Basically the row object is only stored into the "value".
+ 
+ One principle of Hive is that Hive does not own the HDFS file format - Users 
should be able to directly read the HDFS files in the Hive tables using other 
tools, or use other tools to directly write to HDFS files that can be read by 
Hive through "CREATE EXTERNAL TABLE", or can be loaded into Hive through "LOAD 
DATA INPATH" which just move the file into Hive table directory.
+ 
+ Note that org.apache.hadoop.hive.serde is the deprecated old serde library. 
Please look at org.apache.hadoop.hive.serde2 for the latest version.
+ 
+ Existing FileFormats and SerDe classes
+   * TextInputFormat/NoKeyTextOutputFormat: These 2 classes read/write data in 
plain text file format.
+   * SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes 
read/write data in hadoop SequenceFile format.
+ 
+ Hive currently use these SerDe classes to serialize and deserialize data:
+   * MetadataTypedColumnsetSerDe: This serde is used to read/write delimited 
records like CSV, tab-separated control-A separated records (sorry, quote is 
not supported yet.)
+   * ThriftSerDe: This serde is used to read/write thrift serialized objects.  
The class file for the Thrift object must be loaded first.
+   * DynamicSerDe: This serde also read/write thrift serialized objects, but 
it understands thrift DDL so the schema of the object can be provided at 
runtime.  Also it supports a lot of different protocols, including 
TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in 
delimited records).
+ 
+ How to write your own SerDe:
+   * In most cases, users want to write a Deserializer instead of a SerDe.
+   * For example, the RegexDeserializer will deserialize the data using the 
configuration parameter 'regex', and possibly a list of column names (see 
serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for 
details.
+ 
  === MetaStore ===
+ 
  === Query Processor ===
  The following are the main components of the Hive Query Processor:
   * Parse and SemanticAnalysis (ql/parse) - This component contains the code 
for parsing SQL, converting it into Abstract Syntax Trees, converting the 
Abstract Syntax Trees into Operator Plans and finally converting the operator 
plans into a directed graph of tasks which are executed by Driver.java.

[Hadoop Wiki] Update of "Hive/DeveloperGuide" by ZhengShao

Reply via email to