How to handle null values in an array and keeping the right size of this field ?

Mickaël Lacour Fri, 03 Oct 2014 07:55:38 -0700

Hello all :)


I'm working on this issue : https://issues.apache.org/jira/browse/HIVE-6994

My dataset is very simple : 3 columns. Here the schema (hadoop-tools schema):


message hive_schema {
  optional int32 id;
  optional group lstint (LIST) {
    repeated group bag {
      optional int32 array_element;
    }
  }
  optional group lststr (LIST) {
    repeated group bag {
      optional binary array_element;
    }
  }
}

And the content (hadoop-tools cat)

id = 2
lstint:
.bag:
..array_element = 7
lststr:
.bag:
..array_element = e
.bag:
..array_element = e

And the original data that I wanted to write ("|" is the column delimiter, and 
"," is the elements delimiter inside an array) :

1|7,|e,e

Here my issue: the size of my array (the first one called lstint) should be 2, 
but parquet is only keeping one field (the other is null). So for Parquet the 
size of my array is 1.
I want to keep this information and I don't know how to do it. Basically I 
cannot ask my recordConsumer to  startField if I have no value to add. If I do 
this, when I ask the recordConsumer to endField, I'm having this error :

throw new ParquetEncodingException("empty fields are illegal, the field should 
be ommitted completely instead");

So I can't do this, and I don't have any method inside the recordConsumer to 
add an empty field inside a "column". Of course If my array is null, parquet is 
going to add the null field for this missing column.

And another issue I have (related to this one). I cannot write an array with 
only null fields (|,,,|) I'm getting the previous exception.

Any advice ? (should we add a new method to be able to have empty fields?).

@Julien : I'm adding you in CC because I didn't see the last mail I sent to the 
mailing list. Can you forward it in case I don't have the right permission ? 
Thx !

--

Mickaël Lacour

Senior Software Engineer

Analytics Infrastructure team @Scalability

Criteo

How to handle null values in an array and keeping the right size of this field ?

Reply via email to