Hi,
I am new to Parquet and we have a complicated use case in which we want to
adopt Parquet as our storage format.
Current:
- The data is stored in Sequence files as Protobuf.
- We have map reduce jobs to write the data. Hive tables were created
with Protobuf Serde using elephant-bird so people can query the data via
Hive.
- We enhance elephant-bird to add our own serializer so one can write
data into table via Hive and data is stored in Sequence files as Protobuf.
Future:
We want to use Parquet as the underlying storage format without losing
Protobuf abstraction at application layer. After a bit research and
practice, I have a few questions.
- Say if Hive table is created as Parquet table, and data is written via
Hive.
- If I want to read data in map reduce jobs as Protobuf records, can I
use ProtoParquetInputFormat in
https://github.com/Parquet/parquet-mr/blob/master/parquet-protobuf/src/main/java/parquet/proto/ProtoParquetInputFormat.java?
After looking at the API, it doesn't seem possible that I can
specific the
Protobuf class for the input path. Instead,
ProtoParquetInputFormat derives
the class from the footer of the underlying data. Is it fair to
day ProtoParquetInputFormat will only read data written
by ProtoParquetOutputFormat? Is there a way to work around this?
- If not, is there any out of the box Hive output format I can use to
piggy back ProtoParquetOutputFormat?
- If data is written by map reduce job with ProtoParquetOutputFormat.
Will read query in Hive work automatically?
Thanks a lot in advance. Any suggestions would be appreciated.
--
Chen Song