Hi,

I can implement the method addNull is the recordConsumer(public void 
addNull()), and 
But If I do this, I have an issue when I'm reading the value again. This is 
normal because I'm trying to read an INT where I have an EOF (because I didn't 
have a way to say : skip it, it's null)  

Caused by: parquet.io.ParquetDecodingException: Can't read value in column 
[lstint, bag, array_element] INT32 at value 2 out of 2, 2 out of 2 in 
currentPage. repetition level: 1, definition level: 3
        at 
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:466)
        at 
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:368)
        at 
parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:400)
        at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:173)
        ... 23 more
Caused by: parquet.io.ParquetDecodingException: could not read int
[...]
Caused by: java.io.EOFException
        at 
parquet.bytes.LittleEndianDataInputStream.readInt(LittleEndianDataInputStream.java:352)

The thing is , how I am suppose to read a non existing value ? Do you think we 
could add this feature ? (having null value inside an array) ?
--
Mickaël Lacour
Senior Software Engineer
Analytics Infrastructure team @Scalability

________________________________________
De : Ryan Blue <[email protected]>
Envoyé : vendredi 3 octobre 2014 18:46
À : [email protected]
Cc : Julien Le Dem; Justin Coffey; Remy Pecqueur
Objet : Re: How to handle null values in an array and keeping the right size of 
this field ?

Does it work to add a null value?

  startField("lstint", 0)
   startField("bag", 0)
    addValue(7)
    addValue(null)
   endField("bag", 0)
  endField("lstint", 0)

rb

On 10/03/2014 07:54 AM, Mickaël Lacour wrote:
> Hello all :)
>
>
> I'm working on this issue : https://issues.apache.org/jira/browse/HIVE-6994
>
> My dataset is very simple : 3 columns. Here the schema (hadoop-tools schema):
>
>
> message hive_schema {
>    optional int32 id;
>    optional group lstint (LIST) {
>      repeated group bag {
>        optional int32 array_element;
>      }
>    }
>    optional group lststr (LIST) {
>      repeated group bag {
>        optional binary array_element;
>      }
>    }
> }
>
> And the content (hadoop-tools cat)
>
> id = 2
> lstint:
> .bag:
> ..array_element = 7
> lststr:
> .bag:
> ..array_element = e
> .bag:
> ..array_element = e
>
> And the original data that I wanted to write ("|" is the column delimiter, 
> and "," is the elements delimiter inside an array) :
>
> 1|7,|e,e
>
> Here my issue: the size of my array (the first one called lstint) should be 
> 2, but parquet is only keeping one field (the other is null). So for Parquet 
> the size of my array is 1.
> I want to keep this information and I don't know how to do it. Basically I 
> cannot ask my recordConsumer to  startField if I have no value to add. If I 
> do this, when I ask the recordConsumer to endField, I'm having this error :
>
> throw new ParquetEncodingException("empty fields are illegal, the field 
> should be ommitted completely instead");
>
> So I can't do this, and I don't have any method inside the recordConsumer to 
> add an empty field inside a "column". Of course If my array is null, parquet 
> is going to add the null field for this missing column.
>
> And another issue I have (related to this one). I cannot write an array with 
> only null fields (|,,,|) I'm getting the previous exception.
>
> Any advice ? (should we add a new method to be able to have empty fields?).
>
> @Julien : I'm adding you in CC because I didn't see the last mail I sent to 
> the mailing list. Can you forward it in case I don't have the right 
> permission ? Thx !
>
> --
>
> Mickaël Lacour
>
> Senior Software Engineer
>
> Analytics Infrastructure team @Scalability
>
> Criteo


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to