Re: Definition Levels and Null

Brian Bowman Mon, 13 May 2019 12:23:56 -0700

Thanks Wes,

We are using the Parquet C++ low-level APIs.


Our Parquet "adapter" code will translate the SAS "missing" NaN representation 
to the correct position in the int16_t def level vector passed to the Parquet 
low-level writer.   Similarly, this adapter will reconstitute the NaN "missing" 
representation from the def level vector returned from LevelDecoder() at 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_reader.cc#L77
 up through ReadBatch() and ultimately back to SAS.

-Brian

On 5/13/19, 2:48 PM, "Wes McKinney" <[email protected]> wrote:

    EXTERNAL
    
    To comment from the Parquet C++ side, we expose two writer APIs
    
    * High level, using Apache Arrow -- use Arrow's bitmap-based
    null/valid representation for null values, NaN is NaN
    * Low level, produces your own repetition/definition levels
    
    So if you're using the low level API, and you have values like
    
    [1, 2, 3, NULL = NaN, 5]
    
    then you could represent this as
    
    def_levels = [1, 1, 1, 0, 1]
    rep_levels = nullptr
    values = [1, 2, 3, 5]
    
    If you don't use the definition level encoding of nulls then other
    readers will presume the values to be non-null.
    
    On Mon, May 13, 2019 at 1:06 PM Tim Armstrong
    <[email protected]> wrote:
    >
    > > I see that OPTIONAL or REPEATED must be specified as the Repetition type
    > for columns where def level of 0 indicates NULL and 1 means not NULL.  The
    > SchemaDescriptor::BuildTree method at
    > 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
    > shows how this causes max_def_level to increment.
    > That seems right if your data doesn't have any complex types in it,
    > max_def_level will always be 0 or 1 depending on whether the column is
    > REQUIRED/OPTIONAL. One option, depending on your data model, is to always
    > just mark the field as OPTIONAL and provide the def levels. If they're all
    > 1 they will compress extremely well. Impala actually does this because
    > mostly columns end up being potentially nullable in Impala/Hive data 
model.
    >
    > > We are using standard Parquet API's via C++/libparquet.co and therefore
    > not doing our own Parquet file-format writer/reader.
    > Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a 
quick
    > look and I guess it does expose the concept of ref/def levels.
    >
    > > NaNs representing missing values occur frequently in a myriad of SAS use
    > cases.  Other data types may be NULL as well, so I'm wondering if using 
def
    > level to indicate NULLs is safer (with consideration to other readers) and
    > also consumes less memory/storage across the spectrum of Parquet-supported
    > data types?
    > If I was in your situation, this is what I'd probably do. We're seen a lot
    > more inconsistency with handling of NaN between readers.
    >
    > On Mon, May 13, 2019 at 10:49 AM Brian Bowman <[email protected]> 
wrote:
    >
    > > Tim,
    > >
    > > Thanks for your detailed reply and especially for pointing the RLE
    > > encoding for the def level!
    > >
    > > Your comment:
    > >
    > >     <<- If the field is required, the max def level is 0, therefore all
    > > values
    > >        are 0, therefore the def levels can be "decoded" from nothing and
    > > the def
    > >        levels can be omitted for the page.>>
    > >
    > > I see that OPTIONAL or REPEATED must be specified as the Repetition type
    > > for columns where def level of 0 indicates NULL and 1 means not NULL.  
The
    > > SchemaDescriptor::BuildTree method at
    > > 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
    > > shows how this causes max_def_level to increment.
    > >
    > > We are using standard Parquet API's via C++/libparquet.co and therefore
    > > not doing our own Parquet file-format writer/reader.
    > >
    > > NaNs representing missing values occur frequently in a myriad of SAS use
    > > cases.  Other data types may be NULL as well, so I'm wondering if using 
def
    > > level to indicate NULLs is safer (with consideration to other readers) 
and
    > > also consumes less memory/storage across the spectrum of 
Parquet-supported
    > > data types?
    > >
    > > Best,
    > >
    > > Brian
    > >
    > >
    > > On 5/13/19, 1:03 PM, "Tim Armstrong" <[email protected]>
    > > wrote:
    > >
    > >     EXTERNAL
    > >
    > >     Parquet float/double values can hold any IEEE floating point value -
    > >
    > > 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413
    > > .
    > >     So there's no reason you can't write NaN to the files. If a reader
    > > isn't
    > >     handling NaN values correctly, that seems like an issue with that
    > > reader,
    > >     although I think you're correct in that you're more likely to hit
    > > reader
    > >     bugs with NaN than NULL. (I may be telling you something you already
    > > know,
    > >     but thought I'd start with that).
    > >
    > >     I don't think the Parquet format is opinionated about what NULL vs 
NaN
    > >     means, although I'd assume that NULL means that the data simply 
wasn't
    > >     present, and NaN means that it was the result of a floating point
    > >     calculation that resulted in NaN.
    > >
    > >     The rep/definition level encoding is fairly complex because of the
    > > handling
    > >     of nested types and the various ways of encoding the sequence of
    > > levels.
    > >     The way I'd think about it is:
    > >
    > >        - If you don't have any complex/nested types, rep levels aren't
    > > needed
    > >        and the logical def levels degenerate into 1=not null, 0 = null.
    > >        - The RLE encoding has a bit-width implied by the max def level
    > > value -
    > >        if the max-level is 1, 1 bit is needed per value. If it is 0, 0
    > > bits are
    > >        needed per value.
    > >        - If the field is required, the max def level is 0, therefore all
    > > values
    > >        are 0, therefore the def levels can be "decoded" from nothing and
    > > the def
    > >        levels can be omitted for the page.
    > >        - If the field is nullable, the bit width is 0, therefore each 
def
    > > level
    > >        is logically a bit. However, RLE encoding is applied to the
    > > sequence of 1/0
    > >        levels -
    > >        https://github.com/apache/parquet-format/blob/master/Encodings.md
    > >
    > >     The last point is where I think your understanding might diverge 
from
    > > the
    > >     implementation - the encoded def levels are not simply a bit vector,
    > > it's a
    > >     more complex hybrid RLE/bit-packed encoding.
    > >
    > >     If you use one of the existing Parquet libraries it will handle all
    > > this
    > >     for you - it's a headache to get it all right from scratch.
    > >     - Tim
    > >
    > >
    > >     On Mon, May 13, 2019 at 8:43 AM Brian Bowman <[email protected]>
    > > wrote:
    > >
    > >     > All,
    > >     >
    > >     > I’m working to integrate the historic usage of SAS missing values
    > > for IEEE
    > >     > doubles into our SAS Viya Parquet integration.  SAS writes a NAN 
to
    > >     > represent floating-point doubles that are “missing,” i.e. NULL in
    > > more
    > >     > general data management terms.
    > >     >
    > >     > Of course SAS’ goal is to create .parquet files that are 
universally
    > >     > readable.  Therefore, it appears that the SAS Parquet writer(s) 
will
    > > NOT be
    > >     > able to write the usual NAN to represent “missing,” because doing 
so
    > > will
    > >     > cause a floating point exception for other readers.
    > >     >
    > >     > Based on the Parquet doc at:
    > >     > https://parquet.apache.org/documentation/latest/ and by examining
    > > code, I
    > >     > understand that Parquet NULL values are indicated by setting 0x000
    > > at the
    > >     > definition level vector offset corresponding to each NULL column
    > > offset
    > >     > value.
    > >     >
    > >     > Conversely, It appears that the per-column, per page definition
    > > level data
    > >     > is never written when required is not specified for the column
    > > schema.
    > >     >
    > >     > Is my understanding and Parquet terminology correct here?
    > >     >
    > >     > Thanks,
    > >     >
    > >     > Brian
    > >     >
    > >
    > >
    > >

Re: Definition Levels and Null

Reply via email to