Hey Ryan, I take therefore that parquet-thrift structure using list/set/maps are not supported with hive as of today.
Regarding the patch that I posted, since I need to make it work for my deployment regardless, does the approach make sense so far? I still need to hack into the ObjInspector so that once hive encounters a LIST field (from hive standpoint), the ObjInspector removes one unnecessary layer of ArrayWritable coming from the extra "_tuple" field. Does that make sense? Thanks, On Tue, Nov 11, 2014 at 3:14 PM, Ryan Blue <b...@cloudera.com> wrote: > On 11/11/2014 01:07 PM, Jean-Pascal Billaud wrote: > >> While running "select * from parquet_requests", the whole thing crashes >> with the >> following exception: >> >> > public ArrayWritableGroupConverter(final GroupType groupType, final >> HiveGroupConverter parent, >> > final int index) { >> > this.parent = parent; >> > this.index = index; >> > int count = groupType.getFieldCount(); >> > if (count < 1 || count > 2) { >> > throw new IllegalStateException("Field count must be either 1 or >> 2: >> " + count); >> > } >> > >> >> What this means is that requests_tuple is not considered a valid list >> because >> it has more than one field. It basically expects the "repeated" keyword on >> the >> "requests (LIST)" as opposed to "requests_tuple". The actual code also >> does >> not >> seem to handle repeated on primitives since the ETypeConverters always >> call >> parent.set() hence always replacing the previous stored instance. >> >> I cooked up a patch which as far as I can tell would fix the issues here >> and >> I would like to have some comments to see if that patch is in the right >> direction >> before submitting a more formal pull request. Things need to be polished >> so >> please don't spend too much time on the form but more on the approach. >> >> https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbd >> f8cd9b920d >> >> Moreover, I have a feeling that I should probably not pass the thrift >> class >> for >> the parquet table given that at this point it is totally irrelevant and >> the >> parquet >> schema is stored in the parquet files. I also expect some ObjectInspector >> issue >> due to the extra grouping provided by the requests_tuple entry. Thoughts? >> >> Thanks, >> >> > Hi Jean-Pascal, > > This is a known issue that we're going to be fixing shortly. The problem > is that there's a difference in the way Hive and Thrift (or Avro) > represents lists. PARQUET-113 [1] is an effort to define what is currently > being written and what we need to do to add the compatibility. It also > specifies what should be written. > > Hive is one of the first object models that will be updated with the > backward-compatibility rules so that it can read parquet-avro and > parquet-thrift structures correctly. > > rb > > [1]: https://issues.apache.org/jira/browse/PARQUET-113 > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >