Hi Andrew,

Thanks for the input.  I think these design choices mostly revolve around
distributions of physical types we expect to see for specific fields.  I've
added some more examples below.


> TLDR is that for the shredding question, I suggest the format not
> require/allow normalization and leave that as a feature for writer
> implementations if desired[1].


This makes sense to me as a reasonable path forwards.  It leaves the
question of whether more metadata is needed in the case that clients don't
want the parquet implementation to do normalization.  For instance, if
presented with a mix of physical int16 and int32 values for field, it seems
at the storage level it is going to be more efficient to shred all values
in an int32 column (this would be the physical type anyways), and add extra
metadata to reconstruct the original physical types if needed.

In terms of statistics, I think trying to require the binary version to be
> useful for min/max calculations with existing statistics would be of very
> limited value (though I may misunderstand).


I agree with this without shredding.  In most cases I expect top level
variants to mostly be "Objects", so min/max probably doesn't add too much
value.  When thinking about shredding the general expectation is that
fields are mostly of the same type and there will be a fair number of
shredded primitive types.  To better understand the bit ordering
implications we can look at a concrete, possibly contrived, example.

Consider a field shredded into an int32  typed value.  The remaining data
(encoded in binary form) has short strings of lengths between 0 (leading
byte 0b00000001) and  length 7 (leading byte 0b00011101).

In the current binary encoding scheme, we cannot determine whether the
binary encoded values contain an int64 value (IIUC leading byte is
0b00011000).  This implies that the statistics for the shredded int32
columns cannot be used for pruning, and for projecting an exact numeric
value the entire binary encoded column needs to be parsed.  If the
basic_type data was encoded in the most significant bits we would get
(leading byte 0b01000000) for an empty short string, (leading byte
0b01000111) for length 7 short strings and (0b00000110) for int64.  This
would allow mix/max bounds to be used to definitively determine if there
were only short strings vs other primitive types.

In my mind if we were starting from scratch the latter encoding is more
useful [1] and would probably be preferred even if we don't know exactly
how often it might  be used (I might be missing some advantages of the
current scheme though). The main argument against it is the practical
concern with compatibility of the existing encoding (which from
Parquet's perspective is experimental).

Cheers,
Micah

[1] It also appears better for a nano-optimization; it would take one less
instruction to extract length values for short strings (only a mask,
compared to a mask + shift).  I wouldn't expect this to actually impact
performance though.


Cheers,
Micah



On Sun, Dec 8, 2024 at 6:23 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> Thank you for this summary Micah -- I found it very helpful to have a
> summary rather than have to incrementally put it together based on comments
> in the PR
>
> TLDR is that for the shredding question, I suggest the format not
> require/allow normalization and leave that as a feature for writer
> implementations if desired[1].
>
> >   In the long run this could include adding variant specific
> > statistics about which types are encoded, but in short run, this exposes
> > the fact that the binary encoding of types appears to be suboptimal for
> use
> > in min/max statistics.
>
> In terms of statistics, I think trying to require the binary version to be
> useful for min/max calculations with existing statistics would be of very
> limited value (though I may misunderstand). It seems unlikely to me that
> there are many usecases where a single min/max value for a binary
> representation would be useful (though maybe it would make sense for
> separately stored columns)
>
> I think focusing on getting the variants into the spec and widely supported
> is more important than optimizing the statistics. As you say adding
> specialized statistics for structured types is likely better in the long
> run.
>
> Andrew
>
> [1]:
> https://github.com/apache/parquet-format/pull/461#discussion_r1874839885
>
> On Fri, Dec 6, 2024 at 2:26 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Everyone,
> > I think we are making good progress on the shredding specification.
> >
> > There are  higher level concerns that might benefit from more points of
> > view.  The relevant github discussions are linked below [2][3]. I've also
> > provided a summary below (apologies for its length).
> >
> > For background, Variant is a semi-structured type akin to JSON but with a
> > richer set of primitive types (e.g. timestamp).  The current spec also
> > defines "logical" types that have different physical representations [1]
> > (e.g. "Exact numeric" is composed of Decimal, int32, int64, etc; and
> there
> > are two different physical representations of strings).
> >
> > Shredding refers to a storage optimization of extracting some fields and
> > storing them as individual columns.  e.g.
> >
> > With variants:
> > {"a": 1, "b": "abc"}
> > {"a": 1.2, "b": 1}
> >
> > The "a" field could be stored in a separate column of Decimal(p=2, s=1),
> > which would accurately preserve the semantics of the value (and then be
> > merged back with the "b" column which would be stored as the binary
> > encoding [1] separately).
> >
> > This scheme provides several optimizations for readers:
> > 1.  It allows for more efficient projection (e.g. AS_NUMBER(EXTRACT($.a,
> > variant_col)) would not need to parse the binary encoding)
> > 2.  It potentially allows for using statistics for queries like
> > AS_NUMBER(EXTRACT($.a, variant_col)) > 2 could prune the row
> groups/pages.
> > 3.  It allows for potentially better compression of the values.
> >
> > Optimizations 1 and 2, require knowing that there is nothing stored in
> > Binary encoded form that could affect the results.  Either everything is
> > stored in the typed column or any value stored in binary encoded form
> would
> > not affect the result (e.g. in the case above for field "a" if all values
> > stored in binary encoded form are strings, that is if there is a third
> > value `{"a": "n/a" }` it would not affect the results since AS_NUMBER
> would
> > return null for it).
> >
> > Optimization 1,2 and 3, is most effective when values are "normalized"
> > (e.g. int32 and int64 are both stored as int64 allowing for more values
> to
> > be stored as a normal column instead of  binary encoded form).
> >
> > Given this background, the open questions are:
> > 1.  Should parquet implementations have freedom to do normalization on
> > their own (i.e. consolidate different semantically equivalent physical
> > types into one shredded column and therefore potentially return different
> > physical types to readers then were provided as input).  If
> implementations
> > can't do this optimization, the implications are either:
> >     a. Parquet implementations rely on higher level components to try to
> > normalize as much as possible
> >     b.  Parquet adds into the specification additional metadata at the
> > parquet level to still do normalization but reproduce the original
> physical
> > type.
> >     c.  We lose some level of optimization if semantically equivalent
> types
> > are stored, since only one physical type can be shredded into its own
> > column.
> >
> > 2.  Avoiding parsing binary encoded variant values is one of the main
> goals
> > of shredding.  Therefore, it is useful to have as much metadata to make
> the
> > determination if binary encoded values are relevant to a
> projection/filter
> > operation.  In the long run this could include adding variant specific
> > statistics about which types are encoded, but in short run, this exposes
> > the fact that the binary encoding of types appears to be suboptimal for
> use
> > in min/max statistics.
> >
> > IIUC, the current encoding appears to encode type information in the
> least
> > significant bits of the first byte.  This implies that some types can
> > make min/max values effectively useless for determining which set of
> types
> > are encoded (e.g. short strings can make it impossible to tell if all
> > values belong to the same type).  A secondary concern is that type ids
> are
> > tightly packed now.  This means introduction of a new physical type that
> > is  semantically equivalent to existing type (e.g. float16) could make
> > stats less useful.
> >
> > Fixing the binary encoding issues would likely require a different
> version
> > in the binary specification [4] to accomodate data written before the
> spec
> > was donated to the parquet community (as well as the baggage of backwards
> > compatibility).
> >
> > Cheers,
> > Micah
> >
> >
> >
> >
> >
> > [1]
> > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> > [2] https://github.com/apache/parquet-format/pull/461/files#r1851373388
> > [3]
> > https://github.com/apache/parquet-format/pull/461#discussion_r1855433947
> > [4]
> >
> >
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding
> >
>

Reply via email to