Hi, I am trying to integrate parquet as the underlying storage format in our data pipeline but I am facing some issues which I hope some of you can help me with.
The batch layer is fairly standard, some cascading write thrift log objecs from an input tap to a parquet output sink. As a snippet of one of the thrift structure serialized: struct RequestInfo { 1: optional string status, 2: optional list<RequestDetails> requests, } struct RequestDetails { 1: optional string type, 2: optional bool valid, } Looking at the cascading Parquet writer, this translates into this: optional binary status (UTF8); optional group requests (LIST) { repeated group requests_tuple { optional binary type (UTF8); optional boolean valid; } } Then I have a hive table that points to the parquet file while specifying the thrift class serialized. CREATE EXTERNAL TABLE parquet_requests ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs_somewhere' TBLPROPERTIES ( 'thrift.class' = 'RequestInfo' ); While running "select * from parquet_requests", the whole thing crashes with the following exception: > public ArrayWritableGroupConverter(final GroupType groupType, final HiveGroupConverter parent, > final int index) { > this.parent = parent; > this.index = index; > int count = groupType.getFieldCount(); > if (count < 1 || count > 2) { > throw new IllegalStateException("Field count must be either 1 or 2: " + count); > } > What this means is that requests_tuple is not considered a valid list because it has more than one field. It basically expects the "repeated" keyword on the "requests (LIST)" as opposed to "requests_tuple". The actual code also does not seem to handle repeated on primitives since the ETypeConverters always call parent.set() hence always replacing the previous stored instance. I cooked up a patch which as far as I can tell would fix the issues here and I would like to have some comments to see if that patch is in the right direction before submitting a more formal pull request. Things need to be polished so please don't spend too much time on the form but more on the approach. https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbdf8cd9b920d Moreover, I have a feeling that I should probably not pass the thrift class for the parquet table given that at this point it is totally irrelevant and the parquet schema is stored in the parquet files. I also expect some ObjectInspector issue due to the extra grouping provided by the requests_tuple entry. Thoughts? Thanks,