Re: how to write a SerDe

Zheng Shao Fri, 24 Jul 2009 01:39:42 -0700

Sorry about the delay on this.


Here are several example SerDes that got added to the code base recently:

RegexSerDe: A SerDe for parsing text using regex (and an example for
parsing Apache Log using a regex)
  https://issues.apache.org/jira/browse/HIVE-167
  contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

BinarySortableSerDe: A SerDe that serializes rows into a binary format
that keeps the relative order of rows.
  https://issues.apache.org/jira/browse/HIVE-553
  
serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java

We also have a ThriftDeserializer.java, which you can take as an
example for writing ProtocolBufferSerDe.
I will also try to clean up ThriftDeserializer a bit.


Zheng

On Sun, Jul 12, 2009 at 11:41 PM, Zheng Shao<[email protected]> wrote:
> Hi Kevin,
>
> Yes I will work on a how-to tutorial on SerDe this week.
>
> One important performance benefit of Hive SerDe is that it can reuse
> the same object to deserialize different rows - which means there can
> be no object creation needed for each of the rows.
>
> Zheng
>
> On Sun, Jul 12, 2009 at 10:15 PM, Kevin Weil<[email protected]> wrote:
>> +1 to Roberto's question... I'd love some more examples here too.  I looked
>> into writing a protocol buffer Serde a little while ago (the company I was
>> working for had data coming in as protobufs, and it seemed silly to convert
>> every piece to thrift first) and was underwhelmed by the
>> documentation/explanations.  FWIW, and maybe to generate a little friendly
>> competition, I was able to write a pig LoadFunc to load arbitrary protocol
>> buffers to pig tuples without much trouble...
>> Kevin
>>
>> On Wed, Jul 8, 2009 at 4:26 PM, Roberto Congiu <[email protected]>
>> wrote:
>>>
>>> Hi,
>>> I am writing a SerDe class to be able to query some proprietary format we
>>> have from hive.
>>> The format is basically a sequence of records that are maps coded in
>>> binary for which we have access libraries.
>>> The file is also gzipped.
>>> For what I understand, I need to
>>> 1 - write a FileInputFormat class to read the file and extract the single
>>> records as Writables (but I am not clear how I tell hive to use this
>>> fileformat since all I can use is STORED AS SEQUENCEFILE/TEXTFILE. How do I
>>> plug my format in there? )
>>> 2 - Write a SerDe (Since I just need to read it I need just the
>>> deserializer part) and an ObjectInspector to let hive understand how to find
>>> a column
>>> is there any info around for these or somebody who's done something
>>> similar ?
>>> Thanks in advance,
>>> Roberto
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
Yours,
Zheng

Re: how to write a SerDe

Reply via email to