I just joined in the list today, and it is so quiet here, that I even doubt if
I did join or not.
Anyway, I gave it a try by a question current blocking me.
Most datasets on our production Hadoop cluster currently are stored as AVRO +
SNAPPY format. I heard lots of good things about Parquet, and want to give it a
try.
I followed this web page
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/,
to change one of our ETL to generate Parquet files, instead of Avro, as the
output of our reducer. I used the Parquet + Avro schema, to produce the final
output data, plus snappy codec. Everything works fine. So the final output
parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM
BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1
array<struct<sub1:string, sub2:string, date_value:bigint>>,field2
array<struct<..............>>ROW FORMAT SERDE
'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT
'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT
'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any
problem.
But when I tried to query the table, like "select * from table limit 2", I got
the following error:Caused by: java.lang.RuntimeException: Invalid parquet hive
schema: repeated group array { required binary sub1 (UTF8); optional binary
sub2 (UTF8); optional int64 date_value;} at
parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56)
at
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36)
at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61)
at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46)
at
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38)
at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61)
at
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40)
at
parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32)
at
parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109)
at
parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
at
parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230)
at
parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522)
... 14 more
I noticed that the error comes from the first nested array of struct columns.
My question is following:
1) Does Parquet support the nested array of struct?2) Is this only related to
Parquet 1.3.2? Do I have any solution on Parquet 1.3.2?3) If I have to use
later version of Parquet to fix above problem, and if Parquet 1.3.2 available
in runtime, will that cause any issue?4) Can I use all kinds of Hive feature,
like "explode" of nest structure, from the parquet data? What we are looking
for is to know if parquet can be used same way as we currently use AVRO, but
gives us the columnar storage benefits which missing from AVRO.
Thanks
Yong