> I see that OPTIONAL or REPEATED must be specified as the Repetition type
for columns where def level of 0 indicates NULL and 1 means not NULL.  The
SchemaDescriptor::BuildTree method at
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
shows how this causes max_def_level to increment.
That seems right if your data doesn't have any complex types in it,
max_def_level will always be 0 or 1 depending on whether the column is
REQUIRED/OPTIONAL. One option, depending on your data model, is to always
just mark the field as OPTIONAL and provide the def levels. If they're all
1 they will compress extremely well. Impala actually does this because
mostly columns end up being potentially nullable in Impala/Hive data model.

> We are using standard Parquet API's via C++/libparquet.co and therefore
not doing our own Parquet file-format writer/reader.
Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a quick
look and I guess it does expose the concept of ref/def levels.

> NaNs representing missing values occur frequently in a myriad of SAS use
cases.  Other data types may be NULL as well, so I'm wondering if using def
level to indicate NULLs is safer (with consideration to other readers) and
also consumes less memory/storage across the spectrum of Parquet-supported
data types?
If I was in your situation, this is what I'd probably do. We're seen a lot
more inconsistency with handling of NaN between readers.

On Mon, May 13, 2019 at 10:49 AM Brian Bowman <[email protected]> wrote:

> Tim,
>
> Thanks for your detailed reply and especially for pointing the RLE
> encoding for the def level!
>
> Your comment:
>
>     <<- If the field is required, the max def level is 0, therefore all
> values
>        are 0, therefore the def levels can be "decoded" from nothing and
> the def
>        levels can be omitted for the page.>>
>
> I see that OPTIONAL or REPEATED must be specified as the Repetition type
> for columns where def level of 0 indicates NULL and 1 means not NULL.  The
> SchemaDescriptor::BuildTree method at
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> shows how this causes max_def_level to increment.
>
> We are using standard Parquet API's via C++/libparquet.co and therefore
> not doing our own Parquet file-format writer/reader.
>
> NaNs representing missing values occur frequently in a myriad of SAS use
> cases.  Other data types may be NULL as well, so I'm wondering if using def
> level to indicate NULLs is safer (with consideration to other readers) and
> also consumes less memory/storage across the spectrum of Parquet-supported
> data types?
>
> Best,
>
> Brian
>
>
> On 5/13/19, 1:03 PM, "Tim Armstrong" <[email protected]>
> wrote:
>
>     EXTERNAL
>
>     Parquet float/double values can hold any IEEE floating point value -
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413
> .
>     So there's no reason you can't write NaN to the files. If a reader
> isn't
>     handling NaN values correctly, that seems like an issue with that
> reader,
>     although I think you're correct in that you're more likely to hit
> reader
>     bugs with NaN than NULL. (I may be telling you something you already
> know,
>     but thought I'd start with that).
>
>     I don't think the Parquet format is opinionated about what NULL vs NaN
>     means, although I'd assume that NULL means that the data simply wasn't
>     present, and NaN means that it was the result of a floating point
>     calculation that resulted in NaN.
>
>     The rep/definition level encoding is fairly complex because of the
> handling
>     of nested types and the various ways of encoding the sequence of
> levels.
>     The way I'd think about it is:
>
>        - If you don't have any complex/nested types, rep levels aren't
> needed
>        and the logical def levels degenerate into 1=not null, 0 = null.
>        - The RLE encoding has a bit-width implied by the max def level
> value -
>        if the max-level is 1, 1 bit is needed per value. If it is 0, 0
> bits are
>        needed per value.
>        - If the field is required, the max def level is 0, therefore all
> values
>        are 0, therefore the def levels can be "decoded" from nothing and
> the def
>        levels can be omitted for the page.
>        - If the field is nullable, the bit width is 0, therefore each def
> level
>        is logically a bit. However, RLE encoding is applied to the
> sequence of 1/0
>        levels -
>        https://github.com/apache/parquet-format/blob/master/Encodings.md
>
>     The last point is where I think your understanding might diverge from
> the
>     implementation - the encoded def levels are not simply a bit vector,
> it's a
>     more complex hybrid RLE/bit-packed encoding.
>
>     If you use one of the existing Parquet libraries it will handle all
> this
>     for you - it's a headache to get it all right from scratch.
>     - Tim
>
>
>     On Mon, May 13, 2019 at 8:43 AM Brian Bowman <[email protected]>
> wrote:
>
>     > All,
>     >
>     > I’m working to integrate the historic usage of SAS missing values
> for IEEE
>     > doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
>     > represent floating-point doubles that are “missing,” i.e. NULL in
> more
>     > general data management terms.
>     >
>     > Of course SAS’ goal is to create .parquet files that are universally
>     > readable.  Therefore, it appears that the SAS Parquet writer(s) will
> NOT be
>     > able to write the usual NAN to represent “missing,” because doing so
> will
>     > cause a floating point exception for other readers.
>     >
>     > Based on the Parquet doc at:
>     > https://parquet.apache.org/documentation/latest/ and by examining
> code, I
>     > understand that Parquet NULL values are indicated by setting 0x000
> at the
>     > definition level vector offset corresponding to each NULL column
> offset
>     > value.
>     >
>     > Conversely, It appears that the per-column, per page definition
> level data
>     > is never written when required is not specified for the column
> schema.
>     >
>     > Is my understanding and Parquet terminology correct here?
>     >
>     > Thanks,
>     >
>     > Brian
>     >
>
>
>

Reply via email to