Tim,

Thanks for your detailed reply and especially for pointing the RLE encoding for 
the def level!

Your comment:         

    <<- If the field is required, the max def level is 0, therefore all values
       are 0, therefore the def levels can be "decoded" from nothing and the def
       levels can be omitted for the page.>>

I see that OPTIONAL or REPEATED must be specified as the Repetition type for 
columns where def level of 0 indicates NULL and 1 means not NULL.  The 
SchemaDescriptor::BuildTree method at 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
shows how this causes max_def_level to increment. 

We are using standard Parquet API's via C++/libparquet.co and therefore not 
doing our own Parquet file-format writer/reader.
 
NaNs representing missing values occur frequently in a myriad of SAS use cases. 
 Other data types may be NULL as well, so I'm wondering if using def level to 
indicate NULLs is safer (with consideration to other readers) and also consumes 
less memory/storage across the spectrum of Parquet-supported data types?

Best,

Brian


On 5/13/19, 1:03 PM, "Tim Armstrong" <[email protected]> wrote:

    EXTERNAL
    
    Parquet float/double values can hold any IEEE floating point value -
    
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413.
    So there's no reason you can't write NaN to the files. If a reader isn't
    handling NaN values correctly, that seems like an issue with that reader,
    although I think you're correct in that you're more likely to hit reader
    bugs with NaN than NULL. (I may be telling you something you already know,
    but thought I'd start with that).
    
    I don't think the Parquet format is opinionated about what NULL vs NaN
    means, although I'd assume that NULL means that the data simply wasn't
    present, and NaN means that it was the result of a floating point
    calculation that resulted in NaN.
    
    The rep/definition level encoding is fairly complex because of the handling
    of nested types and the various ways of encoding the sequence of levels.
    The way I'd think about it is:
    
       - If you don't have any complex/nested types, rep levels aren't needed
       and the logical def levels degenerate into 1=not null, 0 = null.
       - The RLE encoding has a bit-width implied by the max def level value -
       if the max-level is 1, 1 bit is needed per value. If it is 0, 0 bits are
       needed per value.
       - If the field is required, the max def level is 0, therefore all values
       are 0, therefore the def levels can be "decoded" from nothing and the def
       levels can be omitted for the page.
       - If the field is nullable, the bit width is 0, therefore each def level
       is logically a bit. However, RLE encoding is applied to the sequence of 
1/0
       levels -
       https://github.com/apache/parquet-format/blob/master/Encodings.md
    
    The last point is where I think your understanding might diverge from the
    implementation - the encoded def levels are not simply a bit vector, it's a
    more complex hybrid RLE/bit-packed encoding.
    
    If you use one of the existing Parquet libraries it will handle all this
    for you - it's a headache to get it all right from scratch.
    - Tim
    
    
    On Mon, May 13, 2019 at 8:43 AM Brian Bowman <[email protected]> wrote:
    
    > All,
    >
    > I’m working to integrate the historic usage of SAS missing values for IEEE
    > doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
    > represent floating-point doubles that are “missing,” i.e. NULL in more
    > general data management terms.
    >
    > Of course SAS’ goal is to create .parquet files that are universally
    > readable.  Therefore, it appears that the SAS Parquet writer(s) will NOT 
be
    > able to write the usual NAN to represent “missing,” because doing so will
    > cause a floating point exception for other readers.
    >
    > Based on the Parquet doc at:
    > https://parquet.apache.org/documentation/latest/ and by examining code, I
    > understand that Parquet NULL values are indicated by setting 0x000 at the
    > definition level vector offset corresponding to each NULL column offset
    > value.
    >
    > Conversely, It appears that the per-column, per page definition level data
    > is never written when required is not specified for the column schema.
    >
    > Is my understanding and Parquet terminology correct here?
    >
    > Thanks,
    >
    > Brian
    >
    

Reply via email to