Sorry about the delay on this.
Here are several example SerDes that got added to the code base recently: RegexSerDe: A SerDe for parsing text using regex (and an example for parsing Apache Log using a regex) https://issues.apache.org/jira/browse/HIVE-167 contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java BinarySortableSerDe: A SerDe that serializes rows into a binary format that keeps the relative order of rows. https://issues.apache.org/jira/browse/HIVE-553 serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java We also have a ThriftDeserializer.java, which you can take as an example for writing ProtocolBufferSerDe. I will also try to clean up ThriftDeserializer a bit. Zheng On Sun, Jul 12, 2009 at 11:41 PM, Zheng Shao<[email protected]> wrote: > Hi Kevin, > > Yes I will work on a how-to tutorial on SerDe this week. > > One important performance benefit of Hive SerDe is that it can reuse > the same object to deserialize different rows - which means there can > be no object creation needed for each of the rows. > > Zheng > > On Sun, Jul 12, 2009 at 10:15 PM, Kevin Weil<[email protected]> wrote: >> +1 to Roberto's question... I'd love some more examples here too. I looked >> into writing a protocol buffer Serde a little while ago (the company I was >> working for had data coming in as protobufs, and it seemed silly to convert >> every piece to thrift first) and was underwhelmed by the >> documentation/explanations. FWIW, and maybe to generate a little friendly >> competition, I was able to write a pig LoadFunc to load arbitrary protocol >> buffers to pig tuples without much trouble... >> Kevin >> >> On Wed, Jul 8, 2009 at 4:26 PM, Roberto Congiu <[email protected]> >> wrote: >>> >>> Hi, >>> I am writing a SerDe class to be able to query some proprietary format we >>> have from hive. >>> The format is basically a sequence of records that are maps coded in >>> binary for which we have access libraries. >>> The file is also gzipped. >>> For what I understand, I need to >>> 1 - write a FileInputFormat class to read the file and extract the single >>> records as Writables (but I am not clear how I tell hive to use this >>> fileformat since all I can use is STORED AS SEQUENCEFILE/TEXTFILE. How do I >>> plug my format in there? ) >>> 2 - Write a SerDe (Since I just need to read it I need just the >>> deserializer part) and an ObjectInspector to let hive understand how to find >>> a column >>> is there any info around for these or somebody who's done something >>> similar ? >>> Thanks in advance, >>> Roberto >> > > > > -- > Yours, > Zheng > -- Yours, Zheng
