[
https://issues.apache.org/jira/browse/PARQUET-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113830#comment-15113830
]
Deepak Majeti commented on PARQUET-459:
---------------------------------------
I agree that it better to expose the definition and repetition levels past the
column readers for nested data. The downstream application should manage
stitching the collection data based on its internal data structures.
For a flat data iteration, I was hoping we can do away with an _array_ _of_
_bools_ for each value representing a null or not instead of exposing
repetition/definition levels. This scheme could save us some space as well.
Also, can you just read the data values from the data pages without the help of
definition and repetition levels ? I am asking this because nulls are not
stored physically and you might have to speculate when reading a data page.
However, if you end up using definition and repetition levels to read data
values of flat data, it will be redundant to pass def/rep values to the top
level.
I was hoping PARQUET-435 will have different classes for flat data
"ScalarColumnReader" and nested data "CollectionColumnReader". We could provide
different API for flat and nested data ColumnReaders for efficiency.
I don't see the current API exposing the definition values to infer nulls.
_definition\_level\_decoder_ is private.
The current API looks at only the definition level and it should work since it
only supports flat data at the top level. But if we want to infer nulls from
nested data columns, we will need both repetition and definition levels.
I am waiting for the pull 22 to merge. I want to pull PARQUET-428 as well.
> Improve handling of null values
> -------------------------------
>
> Key: PARQUET-459
> URL: https://issues.apache.org/jira/browse/PARQUET-459
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: Deepak Majeti
>
> Currently, the default value of the type is returned for NULL values and is
> incorrect.
> This JIRA will correctly identify a NULL value with the help of an additional
> variable that will be set for NULL values.
> This feature depends on reading the repetition level (PARQUET-169).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)