Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by ZhengShao: http://wiki.apache.org/hadoop/Hive/DeveloperGuide ------------------------------------------------------------------------------ * trunk/testutils (Deprecated) === SerDe === - What is SerDe + What is !SerDe - * SerDe is a short name for Serializer and Deserializer. + * !SerDe is a short name for Serializer and Deserializer. - * Hive uses SerDe (and FileFormat) to read from/write to tables. + * Hive uses SerDe (and !FileFormat) to read from/write to tables. - * HDFS files --(InputFileFormat)--> <key, value> --(Deserializer)--> Row object + * HDFS files --(!InputFileFormat)--> <key, value> --(Deserializer)--> Row object - * Row object --(Serializer)--> <key, value> --(OutputFileFormat)--> HDFS files + * Row object --(Serializer)--> <key, value> --(!OutputFileFormat)--> HDFS files Note that the "key" part is ignored when reading, and is always a constant when writing. Basically the row object is only stored into the "value". @@ -40, +40 @@ Note that org.apache.hadoop.hive.serde is the deprecated old serde library. Please look at org.apache.hadoop.hive.serde2 for the latest version. Hive currently use these FileFormat classes to read and write HDFS files: - * TextInputFormat/NoKeyTextOutputFormat: These 2 classes read/write data in plain text file format. + * !TextInputFormat/NoKeyTextOutputFormat: These 2 classes read/write data in plain text file format. - * SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format. + * !SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop !SequenceFile format. - Hive currently use these SerDe classes to serialize and deserialize data: + Hive currently use these !SerDe classes to serialize and deserialize data: - * MetadataTypedColumnsetSerDe: This serde is used to read/write delimited records like CSV, tab-separated control-A separated records (sorry, quote is not supported yet.) + * !MetadataTypedColumnsetSerDe: This !SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (sorry, quote is not supported yet.) - * ThriftSerDe: This serde is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first. + * !ThriftSerDe: This !SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first. - * DynamicSerDe: This serde also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records). + * !DynamicSerDe: This !SerDe also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including !TBinaryProtocol, !TJSONProtocol, TCTL!SeparatedProtocol (which writes data in delimited records). - How to write your own SerDe: + How to write your own !SerDe: - * In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it. + * In most cases, users want to write a Deserializer instead of a !SerDe, because users just want to read their own data format instead of writing to it. - * For example, the RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details. + * For example, the !RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details. - * If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through "thrift DDL" format, and it's non-trivial to write a "thrift DDL" parser. + * If your !SerDe supports DDL (basically, !SerDe with parameterized columns and column types), you probably want to implement a Protocol based on !DynamicSerDe, instead of writing a !SerDe from scratch. The reason is that the framework passes DDL to !SerDe through "thrift DDL" format, and it's non-trivial to write a "thrift DDL" parser. - Some important points of SerDe: + Some important points of !SerDe: - * SerDe, not the DDL, defines the table schema. Some SerDe implementations use the DDL for configuration, but SerDe can also override that. + * !SerDe, not the DDL, defines the table schema. Some !SerDe implementations use the DDL for configuration, but !SerDe can also override that. * Column types can be arbitrarily nested arrays, maps and structures. - * The callback design of ObjectInspector allows lazy deserialization with CASE/IF or when using complex or nested types. + * The callback design of !ObjectInspector allows lazy deserialization with CASE/IF or when using complex or nested types. + + ==== ObjectInspector ==== + Hive uses !ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. + + !ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including: + * Instance of a Java class (Thrift or native Java) + * A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map) + * A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field) + + A complex object can be represented by a pair of !ObjectInspector and Java Object. + The !ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object. === MetaStore ===
