[jira] [Commented] (PARQUET-459) Improve handling of null values

Deepak Majeti (JIRA) Sat, 23 Jan 2016 08:41:01 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113830#comment-15113830
 ]


Deepak Majeti commented on PARQUET-459:
---------------------------------------

I agree that it better to expose the definition and repetition levels past the 
column readers for nested data. The downstream application should manage 
stitching the collection data based on its internal data structures.
For a flat data iteration, I was hoping we can do away with an _array_ _of_ 
_bools_ for each value representing a null or not instead of exposing 
repetition/definition levels. This scheme could save us some space as well.
Also, can you just read the data values from the data pages without the help of 
definition and repetition levels ? I am asking this because nulls are not 
stored physically and you might have to speculate when reading a data page. 
However, if you end up using definition and repetition levels to read data 
values of flat data, it will be redundant to pass def/rep values to the top 
level.

I was hoping PARQUET-435 will have different classes for flat data 
"ScalarColumnReader" and nested data "CollectionColumnReader". We could provide 
different API for flat and nested data ColumnReaders for efficiency.

I don't see the current API exposing the definition values to infer nulls. 
_definition\_level\_decoder_ is private.
The current API looks at only the definition level and it should work since it 
only supports flat data at the top level. But if we want to infer nulls from 
nested data columns, we will need both repetition and definition levels.

I am waiting for the pull 22 to merge. I want to pull PARQUET-428 as well.

> Improve handling of null values
> -------------------------------
>
>                 Key: PARQUET-459
>                 URL: https://issues.apache.org/jira/browse/PARQUET-459
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Deepak Majeti
>
> Currently, the default value of the type is returned for NULL values and is 
> incorrect.
> This JIRA will correctly identify a NULL value with the help of an additional 
> variable that will be set for NULL values. 
> This feature depends on reading the repetition level (PARQUET-169).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-459) Improve handling of null values

Reply via email to