parquet + probuf + hive

Chen Song Sun, 26 Oct 2014 12:31:03 -0700

Hi,

I am new to Parquet and we have a complicated use case in which we want to
adopt Parquet as our storage format.


Current:

   - The data is stored in Sequence files as Protobuf.
   - We have map reduce jobs to write the data. Hive tables were created
   with Protobuf Serde using elephant-bird so people can query the data via
   Hive.
   - We enhance elephant-bird to add our own serializer so one can write
   data into table via Hive and data is stored in Sequence files as Protobuf.


Future:
We want to use Parquet as the underlying storage format without losing
Protobuf abstraction at application layer. After a bit research and
practice, I have a few questions.

   - Say if Hive table is created as Parquet table, and data is written via
   Hive.
   - If I want to read data in map reduce jobs as Protobuf records, can I
      use ProtoParquetInputFormat in
      
https://github.com/Parquet/parquet-mr/blob/master/parquet-protobuf/src/main/java/parquet/proto/ProtoParquetInputFormat.java?
      After looking at the API, it doesn't seem possible that I can
specific the
      Protobuf class for the input path. Instead,
ProtoParquetInputFormat derives
      the class from the footer of the underlying data. Is it fair to
      day ProtoParquetInputFormat will only read data written
      by ProtoParquetOutputFormat? Is there a way to work around this?
      - If not, is there any out of the box Hive output format I can use to
      piggy back ProtoParquetOutputFormat?
   - If data is written by map reduce job with ProtoParquetOutputFormat.
   Will read query in Hive work automatically?

Thanks a lot in advance. Any suggestions would be appreciated.

-- 
Chen Song

parquet + probuf + hive

Reply via email to