I just joined in the list today, and it is so quiet here, that I even doubt if 
I did join or not.
Anyway, I gave it a try by a question current blocking me.
Most datasets on our production Hadoop cluster currently are stored as AVRO + 
SNAPPY format. I heard lots of good things about Parquet, and want to give it a 
try.
I followed this web page 
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/,
 to change one of our ETL to generate Parquet files, instead of Avro, as the 
output of our reducer. I used the Parquet + Avro schema, to produce the final 
output data, plus snappy codec. Everything works fine. So the final output 
parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM 
BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 
array<struct<sub1:string, sub2:string, date_value:bigint>>,field2 
array<struct<..............>>ROW FORMAT SERDE 
'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 
'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any 
problem.
But when I tried to query the table, like "select * from table limit 2", I got 
the following error:Caused by: java.lang.RuntimeException: Invalid parquet hive 
schema: repeated group array {  required binary sub1 (UTF8);  optional binary 
sub2 (UTF8);  optional int64 date_value;}        at 
parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56)
        at 
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36)
        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61)
        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46)
        at 
parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38)
        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61)
        at 
parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40)
        at 
parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32)
        at 
parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109)
        at 
parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
        at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118)
        at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)     
   at 
parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230)
        at 
parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522) 
       ... 14 more
I noticed that the error comes from the first nested array of struct columns. 
My question is following:
1) Does Parquet support the nested array of struct?2) Is this only related to 
Parquet 1.3.2? Do I have any solution on Parquet 1.3.2?3) If I have to use 
later version of Parquet to fix above problem, and if Parquet 1.3.2 available 
in runtime, will that cause any issue?4) Can I use all kinds of Hive feature, 
like "explode" of nest structure, from the parquet data? What we are looking 
for is to know if parquet can be used same way as we currently use AVRO, but 
gives us the columnar storage benefits which missing from AVRO.
Thanks
Yong                                      

Reply via email to