If you just need to scan the data once, it makes sense to use hive SerDe to read the data directly (which saves you one I/O round trip).
If you need to read the data multiple times, then it's better to save the 3 columns into separate files. Zheng On Mon, Jul 12, 2010 at 5:08 PM, Leo Alekseyev <dnqu...@gmail.com> wrote: > Hi all, > I was wondering if anyone is using Hive with protocol buffers. The > Hadoop wiki links to > http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook > for SerDe examples; there it says that there is no built-in support > for protobufs. Since this presentation is about a year old, I was > wondering whether there appeared any UDFs, native or third-party, to > deal with them. > > I am also curious about the relative efficiency of performing SerDe > using UDFs in hive vs. running a separate hadoop job to first > deserialize the data from protocol buffers into an ascii flat file > with only the "interesting" fields (going from ~15 fields to ~3), and > then doing the rest of the computation in hive. I'd appreciate any > comments! > > Thanks, > --Leo > -- Yours, Zheng http://www.linkedin.com/in/zshao