Hello all :)
I'm working on this issue : https://issues.apache.org/jira/browse/HIVE-6994 My dataset is very simple : 3 columns. Here the schema (hadoop-tools schema): message hive_schema { optional int32 id; optional group lstint (LIST) { repeated group bag { optional int32 array_element; } } optional group lststr (LIST) { repeated group bag { optional binary array_element; } } } And the content (hadoop-tools cat) id = 2 lstint: .bag: ..array_element = 7 lststr: .bag: ..array_element = e .bag: ..array_element = e And the original data that I wanted to write ("|" is the column delimiter, and "," is the elements delimiter inside an array) : 1|7,|e,e Here my issue: the size of my array (the first one called lstint) should be 2, but parquet is only keeping one field (the other is null). So for Parquet the size of my array is 1. I want to keep this information and I don't know how to do it. Basically I cannot ask my recordConsumer to startField if I have no value to add. If I do this, when I ask the recordConsumer to endField, I'm having this error : throw new ParquetEncodingException("empty fields are illegal, the field should be ommitted completely instead"); So I can't do this, and I don't have any method inside the recordConsumer to add an empty field inside a "column". Of course If my array is null, parquet is going to add the null field for this missing column. And another issue I have (related to this one). I cannot write an array with only null fields (|,,,|) I'm getting the previous exception. Any advice ? (should we add a new method to be able to have empty fields?). @Julien : I'm adding you in CC because I didn't see the last mail I sent to the mailing list. Can you forward it in case I don't have the right permission ? Thx ! -- Mickaël Lacour Senior Software Engineer Analytics Infrastructure team @Scalability Criteo
