Hive Parquet Reader and "repeated" field

Jean-Pascal Billaud Tue, 11 Nov 2014 13:09:15 -0800

Hi,

I am trying to integrate parquet as the underlying storage format in our
data
pipeline but I am facing some issues which I hope some of you can help me
with.


The batch layer is fairly standard, some cascading write thrift log objecs
from
an input tap to a parquet output sink. As a snippet of one of the thrift
structure
serialized:

struct RequestInfo {
  1: optional string status,
  2: optional list<RequestDetails> requests,
}

struct RequestDetails {
  1: optional string type,
  2: optional bool valid,
}

Looking at the cascading Parquet writer, this translates into this:

optional binary status (UTF8);
optional group requests (LIST) {
  repeated group requests_tuple {
    optional binary type (UTF8);
    optional boolean valid;
  }
}

Then I have a hive table that points to the parquet file while specifying
the
thrift class serialized.

CREATE EXTERNAL TABLE parquet_requests
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs_somewhere'
TBLPROPERTIES ( 'thrift.class' = 'RequestInfo' );

While running "select * from parquet_requests", the whole thing crashes
with the
following exception:

  > public ArrayWritableGroupConverter(final GroupType groupType, final
HiveGroupConverter parent,
  >    final int index) {
  >   this.parent = parent;
  >   this.index = index;
  >   int count = groupType.getFieldCount();
  >   if (count < 1 || count > 2) {
  >     throw new IllegalStateException("Field count must be either 1 or 2:
" + count);
  >   }
  >

What this means is that requests_tuple is not considered a valid list
because
it has more than one field. It basically expects the "repeated" keyword on
the
"requests (LIST)" as opposed to "requests_tuple". The actual code also does
not
seem to handle repeated on primitives since the ETypeConverters always call
parent.set() hence always replacing the previous stored instance.

I cooked up a patch which as far as I can tell would fix the issues here and
I would like to have some comments to see if that patch is in the right
direction
before submitting a more formal pull request. Things need to be polished so
please don't spend too much time on the form but more on the approach.

https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbdf8cd9b920d

Moreover, I have a feeling that I should probably not pass the thrift class
for
the parquet table given that at this point it is totally irrelevant and the
parquet
schema is stored in the parquet files. I also expect some ObjectInspector
issue
due to the extra grouping provided by the requests_tuple entry. Thoughts?

Thanks,

Hive Parquet Reader and "repeated" field

Reply via email to