[Hadoop Wiki] Trivial Update of "Hive/DeveloperGuide" by ZhengShao

Apache Wiki Wed, 21 Jan 2009 16:50:29 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/DeveloperGuide

------------------------------------------------------------------------------
   * trunk/testutils (Deprecated)
  
  === SerDe ===
- What is SerDe
+ What is !SerDe
-   * SerDe is a short name for Serializer and Deserializer.
+   * !SerDe is a short name for Serializer and Deserializer.
-   * Hive uses SerDe (and FileFormat) to read from/write to tables.
+   * Hive uses SerDe (and !FileFormat) to read from/write to tables.
-   * HDFS files --(InputFileFormat)--> <key, value> --(Deserializer)--> Row 
object
+   * HDFS files --(!InputFileFormat)--> <key, value> --(Deserializer)--> Row 
object
-   * Row object --(Serializer)--> <key, value> --(OutputFileFormat)--> HDFS 
files
+   * Row object --(Serializer)--> <key, value> --(!OutputFileFormat)--> HDFS 
files
  
  Note that the "key" part is ignored when reading, and is always a constant 
when writing. Basically the row object is only stored into the "value".
  
@@ -40, +40 @@

  Note that org.apache.hadoop.hive.serde is the deprecated old serde library. 
Please look at org.apache.hadoop.hive.serde2 for the latest version.
  
  Hive currently use these FileFormat classes to read and write HDFS files:
-   * TextInputFormat/NoKeyTextOutputFormat: These 2 classes read/write data in 
plain text file format.
+   * !TextInputFormat/NoKeyTextOutputFormat: These 2 classes read/write data 
in plain text file format.
-   * SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes 
read/write data in hadoop SequenceFile format.
+   * !SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes 
read/write data in hadoop !SequenceFile format.
  
- Hive currently use these SerDe classes to serialize and deserialize data:
+ Hive currently use these !SerDe classes to serialize and deserialize data:
-   * MetadataTypedColumnsetSerDe: This serde is used to read/write delimited 
records like CSV, tab-separated control-A separated records (sorry, quote is 
not supported yet.)
+   * !MetadataTypedColumnsetSerDe: This !SerDe is used to read/write delimited 
records like CSV, tab-separated control-A separated records (sorry, quote is 
not supported yet.)
-   * ThriftSerDe: This serde is used to read/write thrift serialized objects.  
The class file for the Thrift object must be loaded first.
+   * !ThriftSerDe: This !SerDe is used to read/write thrift serialized 
objects.  The class file for the Thrift object must be loaded first.
-   * DynamicSerDe: This serde also read/write thrift serialized objects, but 
it understands thrift DDL so the schema of the object can be provided at 
runtime.  Also it supports a lot of different protocols, including 
TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in 
delimited records).
+   * !DynamicSerDe: This !SerDe also read/write thrift serialized objects, but 
it understands thrift DDL so the schema of the object can be provided at 
runtime.  Also it supports a lot of different protocols, including 
!TBinaryProtocol, !TJSONProtocol, TCTL!SeparatedProtocol (which writes data in 
delimited records).
  
- How to write your own SerDe:
+ How to write your own !SerDe:
-   * In most cases, users want to write a Deserializer instead of a SerDe, 
because users just want to read their own data format instead of writing to it.
+   * In most cases, users want to write a Deserializer instead of a !SerDe, 
because users just want to read their own data format instead of writing to it.
-   * For example, the RegexDeserializer will deserialize the data using the 
configuration parameter 'regex', and possibly a list of column names (see 
serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for 
details.
+   * For example, the !RegexDeserializer will deserialize the data using the 
configuration parameter 'regex', and possibly a list of column names (see 
serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for 
details.
-   * If your SerDe supports DDL (basically, SerDe with parameterized columns 
and column types), you probably want to implement a Protocol based on 
DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the 
framework passes DDL to SerDe through "thrift DDL" format, and it's non-trivial 
to write a "thrift DDL" parser.
+   * If your !SerDe supports DDL (basically, !SerDe with parameterized columns 
and column types), you probably want to implement a Protocol based on 
!DynamicSerDe, instead of writing a !SerDe from scratch. The reason is that the 
framework passes DDL to !SerDe through "thrift DDL" format, and it's 
non-trivial to write a "thrift DDL" parser.
  
- Some important points of SerDe:
+ Some important points of !SerDe:
-   * SerDe, not the DDL, defines the table schema. Some SerDe implementations 
use the DDL for configuration, but SerDe can also override that.
+   * !SerDe, not the DDL, defines the table schema. Some !SerDe 
implementations use the DDL for configuration, but !SerDe can also override 
that.
    * Column types can be arbitrarily nested arrays, maps and structures.
-   * The callback design of ObjectInspector allows lazy deserialization with 
CASE/IF or when using complex or nested types.
+   * The callback design of !ObjectInspector allows lazy deserialization with 
CASE/IF or when using complex or nested types.
+ 
+ ==== ObjectInspector ====
+ Hive uses !ObjectInspector to analyze the internal structure of the row 
object and also the structure of the individual columns.
+ 
+ !ObjectInspector provides a uniform way to access complex objects that can be 
stored in multiple formats in the memory, including:
+  * Instance of a Java class (Thrift or native Java)
+  * A standard Java object (we use java.util.List to represent Struct and 
Array, and use java.util.Map to represent Map)
+  * A lazily-initialized object (For example, a Struct of string fields stored 
in a single Java string object with starting offset for each field)
+ 
+ A complex object can be represented by a pair of !ObjectInspector and Java 
Object.
+ The !ObjectInspector not only tells us the structure of the Object, but also 
gives us ways to access the internal fields inside the Object.
  
  
  === MetaStore ===

[Hadoop Wiki] Trivial Update of "Hive/DeveloperGuide" by ZhengShao

Reply via email to