To comment from the Parquet C++ side, we expose two writer APIs

* High level, using Apache Arrow -- use Arrow's bitmap-based
null/valid representation for null values, NaN is NaN
* Low level, produces your own repetition/definition levels

So if you're using the low level API, and you have values like

[1, 2, 3, NULL = NaN, 5]

then you could represent this as

def_levels = [1, 1, 1, 0, 1]
rep_levels = nullptr
values = [1, 2, 3, 5]

If you don't use the definition level encoding of nulls then other
readers will presume the values to be non-null.

On Mon, May 13, 2019 at 1:06 PM Tim Armstrong
<[email protected]> wrote:
>
> > I see that OPTIONAL or REPEATED must be specified as the Repetition type
> for columns where def level of 0 indicates NULL and 1 means not NULL.  The
> SchemaDescriptor::BuildTree method at
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> shows how this causes max_def_level to increment.
> That seems right if your data doesn't have any complex types in it,
> max_def_level will always be 0 or 1 depending on whether the column is
> REQUIRED/OPTIONAL. One option, depending on your data model, is to always
> just mark the field as OPTIONAL and provide the def levels. If they're all
> 1 they will compress extremely well. Impala actually does this because
> mostly columns end up being potentially nullable in Impala/Hive data model.
>
> > We are using standard Parquet API's via C++/libparquet.co and therefore
> not doing our own Parquet file-format writer/reader.
> Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a quick
> look and I guess it does expose the concept of ref/def levels.
>
> > NaNs representing missing values occur frequently in a myriad of SAS use
> cases.  Other data types may be NULL as well, so I'm wondering if using def
> level to indicate NULLs is safer (with consideration to other readers) and
> also consumes less memory/storage across the spectrum of Parquet-supported
> data types?
> If I was in your situation, this is what I'd probably do. We're seen a lot
> more inconsistency with handling of NaN between readers.
>
> On Mon, May 13, 2019 at 10:49 AM Brian Bowman <[email protected]> wrote:
>
> > Tim,
> >
> > Thanks for your detailed reply and especially for pointing the RLE
> > encoding for the def level!
> >
> > Your comment:
> >
> >     <<- If the field is required, the max def level is 0, therefore all
> > values
> >        are 0, therefore the def levels can be "decoded" from nothing and
> > the def
> >        levels can be omitted for the page.>>
> >
> > I see that OPTIONAL or REPEATED must be specified as the Repetition type
> > for columns where def level of 0 indicates NULL and 1 means not NULL.  The
> > SchemaDescriptor::BuildTree method at
> > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> > shows how this causes max_def_level to increment.
> >
> > We are using standard Parquet API's via C++/libparquet.co and therefore
> > not doing our own Parquet file-format writer/reader.
> >
> > NaNs representing missing values occur frequently in a myriad of SAS use
> > cases.  Other data types may be NULL as well, so I'm wondering if using def
> > level to indicate NULLs is safer (with consideration to other readers) and
> > also consumes less memory/storage across the spectrum of Parquet-supported
> > data types?
> >
> > Best,
> >
> > Brian
> >
> >
> > On 5/13/19, 1:03 PM, "Tim Armstrong" <[email protected]>
> > wrote:
> >
> >     EXTERNAL
> >
> >     Parquet float/double values can hold any IEEE floating point value -
> >
> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413
> > .
> >     So there's no reason you can't write NaN to the files. If a reader
> > isn't
> >     handling NaN values correctly, that seems like an issue with that
> > reader,
> >     although I think you're correct in that you're more likely to hit
> > reader
> >     bugs with NaN than NULL. (I may be telling you something you already
> > know,
> >     but thought I'd start with that).
> >
> >     I don't think the Parquet format is opinionated about what NULL vs NaN
> >     means, although I'd assume that NULL means that the data simply wasn't
> >     present, and NaN means that it was the result of a floating point
> >     calculation that resulted in NaN.
> >
> >     The rep/definition level encoding is fairly complex because of the
> > handling
> >     of nested types and the various ways of encoding the sequence of
> > levels.
> >     The way I'd think about it is:
> >
> >        - If you don't have any complex/nested types, rep levels aren't
> > needed
> >        and the logical def levels degenerate into 1=not null, 0 = null.
> >        - The RLE encoding has a bit-width implied by the max def level
> > value -
> >        if the max-level is 1, 1 bit is needed per value. If it is 0, 0
> > bits are
> >        needed per value.
> >        - If the field is required, the max def level is 0, therefore all
> > values
> >        are 0, therefore the def levels can be "decoded" from nothing and
> > the def
> >        levels can be omitted for the page.
> >        - If the field is nullable, the bit width is 0, therefore each def
> > level
> >        is logically a bit. However, RLE encoding is applied to the
> > sequence of 1/0
> >        levels -
> >        https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> >     The last point is where I think your understanding might diverge from
> > the
> >     implementation - the encoded def levels are not simply a bit vector,
> > it's a
> >     more complex hybrid RLE/bit-packed encoding.
> >
> >     If you use one of the existing Parquet libraries it will handle all
> > this
> >     for you - it's a headache to get it all right from scratch.
> >     - Tim
> >
> >
> >     On Mon, May 13, 2019 at 8:43 AM Brian Bowman <[email protected]>
> > wrote:
> >
> >     > All,
> >     >
> >     > I’m working to integrate the historic usage of SAS missing values
> > for IEEE
> >     > doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
> >     > represent floating-point doubles that are “missing,” i.e. NULL in
> > more
> >     > general data management terms.
> >     >
> >     > Of course SAS’ goal is to create .parquet files that are universally
> >     > readable.  Therefore, it appears that the SAS Parquet writer(s) will
> > NOT be
> >     > able to write the usual NAN to represent “missing,” because doing so
> > will
> >     > cause a floating point exception for other readers.
> >     >
> >     > Based on the Parquet doc at:
> >     > https://parquet.apache.org/documentation/latest/ and by examining
> > code, I
> >     > understand that Parquet NULL values are indicated by setting 0x000
> > at the
> >     > definition level vector offset corresponding to each NULL column
> > offset
> >     > value.
> >     >
> >     > Conversely, It appears that the per-column, per page definition
> > level data
> >     > is never written when required is not specified for the column
> > schema.
> >     >
> >     > Is my understanding and Parquet terminology correct here?
> >     >
> >     > Thanks,
> >     >
> >     > Brian
> >     >
> >
> >
> >

Reply via email to