Thanks Wes, We are using the Parquet C++ low-level APIs.
Our Parquet "adapter" code will translate the SAS "missing" NaN representation to the correct position in the int16_t def level vector passed to the Parquet low-level writer. Similarly, this adapter will reconstitute the NaN "missing" representation from the def level vector returned from LevelDecoder() at https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_reader.cc#L77 up through ReadBatch() and ultimately back to SAS. -Brian On 5/13/19, 2:48 PM, "Wes McKinney" <[email protected]> wrote: EXTERNAL To comment from the Parquet C++ side, we expose two writer APIs * High level, using Apache Arrow -- use Arrow's bitmap-based null/valid representation for null values, NaN is NaN * Low level, produces your own repetition/definition levels So if you're using the low level API, and you have values like [1, 2, 3, NULL = NaN, 5] then you could represent this as def_levels = [1, 1, 1, 0, 1] rep_levels = nullptr values = [1, 2, 3, 5] If you don't use the definition level encoding of nulls then other readers will presume the values to be non-null. On Mon, May 13, 2019 at 1:06 PM Tim Armstrong <[email protected]> wrote: > > > I see that OPTIONAL or REPEATED must be specified as the Repetition type > for columns where def level of 0 indicates NULL and 1 means not NULL. The > SchemaDescriptor::BuildTree method at > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661 > shows how this causes max_def_level to increment. > That seems right if your data doesn't have any complex types in it, > max_def_level will always be 0 or 1 depending on whether the column is > REQUIRED/OPTIONAL. One option, depending on your data model, is to always > just mark the field as OPTIONAL and provide the def levels. If they're all > 1 they will compress extremely well. Impala actually does this because > mostly columns end up being potentially nullable in Impala/Hive data model. > > > We are using standard Parquet API's via C++/libparquet.co and therefore > not doing our own Parquet file-format writer/reader. > Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a quick > look and I guess it does expose the concept of ref/def levels. > > > NaNs representing missing values occur frequently in a myriad of SAS use > cases. Other data types may be NULL as well, so I'm wondering if using def > level to indicate NULLs is safer (with consideration to other readers) and > also consumes less memory/storage across the spectrum of Parquet-supported > data types? > If I was in your situation, this is what I'd probably do. We're seen a lot > more inconsistency with handling of NaN between readers. > > On Mon, May 13, 2019 at 10:49 AM Brian Bowman <[email protected]> wrote: > > > Tim, > > > > Thanks for your detailed reply and especially for pointing the RLE > > encoding for the def level! > > > > Your comment: > > > > <<- If the field is required, the max def level is 0, therefore all > > values > > are 0, therefore the def levels can be "decoded" from nothing and > > the def > > levels can be omitted for the page.>> > > > > I see that OPTIONAL or REPEATED must be specified as the Repetition type > > for columns where def level of 0 indicates NULL and 1 means not NULL. The > > SchemaDescriptor::BuildTree method at > > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661 > > shows how this causes max_def_level to increment. > > > > We are using standard Parquet API's via C++/libparquet.co and therefore > > not doing our own Parquet file-format writer/reader. > > > > NaNs representing missing values occur frequently in a myriad of SAS use > > cases. Other data types may be NULL as well, so I'm wondering if using def > > level to indicate NULLs is safer (with consideration to other readers) and > > also consumes less memory/storage across the spectrum of Parquet-supported > > data types? > > > > Best, > > > > Brian > > > > > > On 5/13/19, 1:03 PM, "Tim Armstrong" <[email protected]> > > wrote: > > > > EXTERNAL > > > > Parquet float/double values can hold any IEEE floating point value - > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413 > > . > > So there's no reason you can't write NaN to the files. If a reader > > isn't > > handling NaN values correctly, that seems like an issue with that > > reader, > > although I think you're correct in that you're more likely to hit > > reader > > bugs with NaN than NULL. (I may be telling you something you already > > know, > > but thought I'd start with that). > > > > I don't think the Parquet format is opinionated about what NULL vs NaN > > means, although I'd assume that NULL means that the data simply wasn't > > present, and NaN means that it was the result of a floating point > > calculation that resulted in NaN. > > > > The rep/definition level encoding is fairly complex because of the > > handling > > of nested types and the various ways of encoding the sequence of > > levels. > > The way I'd think about it is: > > > > - If you don't have any complex/nested types, rep levels aren't > > needed > > and the logical def levels degenerate into 1=not null, 0 = null. > > - The RLE encoding has a bit-width implied by the max def level > > value - > > if the max-level is 1, 1 bit is needed per value. If it is 0, 0 > > bits are > > needed per value. > > - If the field is required, the max def level is 0, therefore all > > values > > are 0, therefore the def levels can be "decoded" from nothing and > > the def > > levels can be omitted for the page. > > - If the field is nullable, the bit width is 0, therefore each def > > level > > is logically a bit. However, RLE encoding is applied to the > > sequence of 1/0 > > levels - > > https://github.com/apache/parquet-format/blob/master/Encodings.md > > > > The last point is where I think your understanding might diverge from > > the > > implementation - the encoded def levels are not simply a bit vector, > > it's a > > more complex hybrid RLE/bit-packed encoding. > > > > If you use one of the existing Parquet libraries it will handle all > > this > > for you - it's a headache to get it all right from scratch. > > - Tim > > > > > > On Mon, May 13, 2019 at 8:43 AM Brian Bowman <[email protected]> > > wrote: > > > > > All, > > > > > > I’m working to integrate the historic usage of SAS missing values > > for IEEE > > > doubles into our SAS Viya Parquet integration. SAS writes a NAN to > > > represent floating-point doubles that are “missing,” i.e. NULL in > > more > > > general data management terms. > > > > > > Of course SAS’ goal is to create .parquet files that are universally > > > readable. Therefore, it appears that the SAS Parquet writer(s) will > > NOT be > > > able to write the usual NAN to represent “missing,” because doing so > > will > > > cause a floating point exception for other readers. > > > > > > Based on the Parquet doc at: > > > https://parquet.apache.org/documentation/latest/ and by examining > > code, I > > > understand that Parquet NULL values are indicated by setting 0x000 > > at the > > > definition level vector offset corresponding to each NULL column > > offset > > > value. > > > > > > Conversely, It appears that the per-column, per page definition > > level data > > > is never written when required is not specified for the column > > schema. > > > > > > Is my understanding and Parquet terminology correct here? > > > > > > Thanks, > > > > > > Brian > > > > > > > > >
