Re: Hive Parquet Reader and "repeated" field

Jean-Pascal Billaud Tue, 11 Nov 2014 15:44:15 -0800

Hey Ryan,

I take therefore that parquet-thrift structure using list/set/maps are not
supported with hive as of today.


Regarding the patch that I posted, since I need to make it work for my
deployment regardless, does the approach make sense so far? I still need to
hack into the ObjInspector so that once hive encounters a LIST field (from
hive standpoint), the ObjInspector removes one unnecessary layer of
ArrayWritable coming from the extra "_tuple" field. Does that make sense?

Thanks,

On Tue, Nov 11, 2014 at 3:14 PM, Ryan Blue <b...@cloudera.com> wrote:

> On 11/11/2014 01:07 PM, Jean-Pascal Billaud wrote:
>
>> While running "select * from parquet_requests", the whole thing crashes
>> with the
>> following exception:
>>
>>    > public ArrayWritableGroupConverter(final GroupType groupType, final
>> HiveGroupConverter parent,
>>    >    final int index) {
>>    >   this.parent = parent;
>>    >   this.index = index;
>>    >   int count = groupType.getFieldCount();
>>    >   if (count < 1 || count > 2) {
>>    >     throw new IllegalStateException("Field count must be either 1 or
>> 2:
>> " + count);
>>    >   }
>>    >
>>
>> What this means is that requests_tuple is not considered a valid list
>> because
>> it has more than one field. It basically expects the "repeated" keyword on
>> the
>> "requests (LIST)" as opposed to "requests_tuple". The actual code also
>> does
>> not
>> seem to handle repeated on primitives since the ETypeConverters always
>> call
>> parent.set() hence always replacing the previous stored instance.
>>
>> I cooked up a patch which as far as I can tell would fix the issues here
>> and
>> I would like to have some comments to see if that patch is in the right
>> direction
>> before submitting a more formal pull request. Things need to be polished
>> so
>> please don't spend too much time on the form but more on the approach.
>>
>> https://github.com/jpbillaud/hive/commit/4c1de69b0c484903d663b920c1bfbd
>> f8cd9b920d
>>
>> Moreover, I have a feeling that I should probably not pass the thrift
>> class
>> for
>> the parquet table given that at this point it is totally irrelevant and
>> the
>> parquet
>> schema is stored in the parquet files. I also expect some ObjectInspector
>> issue
>> due to the extra grouping provided by the requests_tuple entry. Thoughts?
>>
>> Thanks,
>>
>>
> Hi Jean-Pascal,
>
> This is a known issue that we're going to be fixing shortly. The problem
> is that there's a difference in the way Hive and Thrift (or Avro)
> represents lists. PARQUET-113 [1] is an effort to define what is currently
> being written and what we need to do to add the compatibility. It also
> specifies what should be written.
>
> Hive is one of the first object models that will be updated with the
> backward-compatibility rules so that it can read parquet-avro and
> parquet-thrift structures correctly.
>
> rb
>
> [1]: https://issues.apache.org/jira/browse/PARQUET-113
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Hive Parquet Reader and "repeated" field

Reply via email to