Re: [DISCUSS] Open Variant Shredding Issues

rdb...@gmail.com Mon, 09 Dec 2024 16:35:53 -0800

I think that Parquet should exactly reproduce the data that is written to
files, rather than either allowing or requiring Parquet implementations to
normalize types. To me, that's a fundamental guarantee of the storage
layer. The compute layer can decide to normalize types and take actions to
make storage more efficient, but storage should not modify the data that is
passed to it.


In addition to being an important and basic guarantee of the format, I
think there are a few other good reasons for this. Normalizing in the
engine keeps the spec small while remaining flexible and expressive. For
example, the value 12.00 (decimal(4,2)) is equivalent to the 12 (int8) for
some use cases, but not in others. If Parquet requires that 12.00 is always
equivalent to 12, then values can't be trusted for the cases that use
decimals for exact precision. Even if normalization is optional, you can't
trust that it wasn't normalized at write time. In addition, the spec would
need a lot more detail because Parquet would need to document rules for
normalization. For instance, when 12 is stored as an int16, should it be
normalized at read time to an int8? What about storing 12 as 12.00
(decimal(4,2))?

I also don't think that we need normalization in Parquet to get good
shredding performance. I think we agree that engines can normalize values
before writing to Parquet, so we aren't really gaining much by allowing
Parquet to do it, too. Plus, engines have additional tools to address these
cases, like applying sort orders that cluster by type and value.

But even if we were to allow Parquet to do this, we've already decided not
to add similar optimizations that preserve types on the basis that they are
not very useful. Earlier in our discussions, I suggested allowing multiple
shredded types for a given field name. For instance, shredding to columns
with different decimal scales. Other people pointed out that while this
would be useful in theory, data tends to be fairly uniformly typed in
practice and it wasn't worth the complexity.

I think for the other issue -- whether to use the most significant bits for
basic type -- this logic also applies. That is, the variation is unlikely
to be enough that this has a significant impact. The untyped values are
likely to be dictionary-encoded, which allows better decision-making. A
small tweak to Micah's example shows this: if the expression were
CAST(EXTRACT($.a, vcol) AS INT), then strings that contain numbers would be
coerced to non-null values. Being able to discard all strings doesn't help
in this case, but checking dictionary values would.

If there were no other concerns, I would probably have opted to use the two
most significant bits for basic type. But this is an existing spec that was
moved from Spark and has been under discussion in a few communities. I've
already built an implementation for the Iceberg project and I'm concerned
that making significant changes like this would require bumping the spec
version and maintaining both code paths. To me, the optimization isn't
worth that complexity and cost. (This does support the suggestion that the
Parquet annotation should have an encoding spec version!)

Ryan

On Mon, Dec 9, 2024 at 12:52 PM Andrew Lamb <andrewlam...@gmail.com> wrote:

> > I think this might be key to the discussion.  In my mind shredding is a
> > storage level concern, and perhaps a shared concern between engines and
> > storage when lower level optimizations are required.
>
> I agree -- perhaps the distinction is "where is the boundary between
> storage layer and query engine and what is the API" -- different systems
> might reasonably have different ideas.
>
> > This is why it is not necessarily unreasonable to let parquet
> implementations do normalization automatically.
>
> I see your point. I think in my opinion remains it would be simpler to
> implement (and thus more likely to get wide adoption) if we did not add the
> notion of data format coercion into the parquet spec, but I can other
> opinions.
>
> I have been writing a lot on this topic; I think it is time to now bow out
> and let others comment if they want
>
> Andrew
>
>
>
> On Mon, Dec 9, 2024 at 1:05 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > >
> > > The processing engine would either have to cast "a" to int64 before
> > > writing, shred it to another column with variant type, or not shred it.
> >
> >
> > I think this might be key to the discussion.  In my mind shredding is a
> > storage level concern, and perhaps a shared concern between engines and
> > storage when lower level optimizations are required.  IMO Parquet
> > implementations should be able to have a function signature like
> > `writeVariant(const std::string& metadata, const std::string& value)` and
> > store it in a reasonably efficient manner, assuming some level of common
> > schema between different variant objects.  As an analogy, most parquet
> > writers will attempt to dictionary encode data as an optimization.
> > Advanced writers will specifically pass through a dictionary for the
> values
> > to be encoded if they are already dictionary encoded in memory.
> >
> > 1. If a field is shredded to a new column, the writer simply writes
> values
> > > of that type (and errors if it sees a different type on some subsequent
> > > record) - no coercion / casting required
> >
> >
> > 2. If a field is shredded to a new column, it is guaranteed to have the
> > > same type and thus statistics should work well
> >
> >
> > I think in practice this is challenging, since one doesn't necessarily
> know
> > the physical types beforehand (thus leading to costly rewrites if not
> > everything could be shredded), it also misses out on a room for
> substantial
> > optimizations at the storage level (suppose 19999 values are int16 and 1
> is
> > of a different type).  The current proposal is closer to this where the
> > shredded type is specific to one type and any type that doesn't fall into
> > that category is binary encoded.  As discussed in the thread already, his
> > approach leaves the validity of statistics in question depending on what
> > types are binary encoded.
> >
> > At least with SQL /  structured databases, having fields with different
> > > types in different records can't be well represented in the data model,
> > so
> > > the reader would likely impose a common type anyways.
> >
> >
> > I think this describes the common case, readers typically only care
> about a
> > field that contains a number.  This is why it is not necessarily
> > unreasonable to let parquet implementations do normalization
> > automatically.  I agree mixing types in SQL is a little awkward but as an
> > example JSON is supported in the SQL standard, and it ends up being an
> > opaque object that can be operated on with accessor functions, that
> either
> > error on unexpected types or return nulls (for example see Postgres's
> > functions [1])
> >
> > [1] https://www.postgresql.org/docs/current/functions-json.html
> >
> > Thanks,
> > Micah
> >
> >
> >
> >
> > On Sun, Dec 8, 2024 at 3:00 PM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
> >
> > > I see -- thank you for the clarification
> > >
> > > >  For instance, if presented with a mix of physical int16 and int32
> > values
> > > for field, it seems
> > > > at the storage level it is going to be more efficient to shred all
> > values
> > > > in an int32 column (this would be the physical type anyways), and add
> > > extra
> > > > metadata to reconstruct the original physical types if needed.
> > >
> > > In my opinion, the file format would be simpler to understand,
> implement,
> > > and reason about if it did not permit shredding fields that had
> different
> > > types into a primitive column.
> > >
> > > For example, in this example, "a" has int32 type in the first record
> but
> > > int64 type in the second and I would suggest not allowing it to be
> > shredded
> > > into an int64 column.
> > > {"a": 1i32, "b": "abc"}
> > > {"a": 2i64, "b": "def"}
> > >
> > > The processing engine would either have to cast "a" to int64 before
> > > writing, shred it to another column with variant type, or not shred it.
> > >
> > > Not permitting shredding a field with multiple types seems like it
> would
> > > avoid the complexities you are describing:
> > > 1. If a field is shredded to a new column, the writer simply writes
> > values
> > > of that type (and errors if it sees a different type on some subsequent
> > > record) - no coercion / casting required
> > > 2. If a field is shredded to a new column, it is guaranteed to have the
> > > same type and thus statistics should work well
> > >
> > > At least with SQL /  structured databases, having fields with different
> > > types in different records can't be well represented in the data model,
> > so
> > > the reader would likely impose a common type anyways.
> > >
> > > Andrew
> > >
> > >
> > >
> > >
> > > On Sun, Dec 8, 2024 at 5:23 PM Micah Kornfield <emkornfi...@gmail.com>
> > > wrote:
> > >
> > > > Hi Andrew,
> > > >
> > > > Thanks for the input.  I think these design choices mostly revolve
> > around
> > > > distributions of physical types we expect to see for specific fields.
> > > I've
> > > > added some more examples below.
> > > >
> > > >
> > > > > TLDR is that for the shredding question, I suggest the format not
> > > > > require/allow normalization and leave that as a feature for writer
> > > > > implementations if desired[1].
> > > >
> > > >
> > > > This makes sense to me as a reasonable path forwards.  It leaves the
> > > > question of whether more metadata is needed in the case that clients
> > > don't
> > > > want the parquet implementation to do normalization.  For instance,
> if
> > > > presented with a mix of physical int16 and int32 values for field, it
> > > seems
> > > > at the storage level it is going to be more efficient to shred all
> > values
> > > > in an int32 column (this would be the physical type anyways), and add
> > > extra
> > > > metadata to reconstruct the original physical types if needed.
> > > >
> > > > In terms of statistics, I think trying to require the binary version
> to
> > > be
> > > > > useful for min/max calculations with existing statistics would be
> of
> > > very
> > > > > limited value (though I may misunderstand).
> > > >
> > > >
> > > > I agree with this without shredding.  In most cases I expect top
> level
> > > > variants to mostly be "Objects", so min/max probably doesn't add too
> > much
> > > > value.  When thinking about shredding the general expectation is that
> > > > fields are mostly of the same type and there will be a fair number of
> > > > shredded primitive types.  To better understand the bit ordering
> > > > implications we can look at a concrete, possibly contrived, example.
> > > >
> > > > Consider a field shredded into an int32  typed value.  The remaining
> > data
> > > > (encoded in binary form) has short strings of lengths between 0
> > (leading
> > > > byte 0b00000001) and  length 7 (leading byte 0b00011101).
> > > >
> > > > In the current binary encoding scheme, we cannot determine whether
> the
> > > > binary encoded values contain an int64 value (IIUC leading byte is
> > > > 0b00011000).  This implies that the statistics for the shredded int32
> > > > columns cannot be used for pruning, and for projecting an exact
> numeric
> > > > value the entire binary encoded column needs to be parsed.  If the
> > > > basic_type data was encoded in the most significant bits we would get
> > > > (leading byte 0b01000000) for an empty short string, (leading byte
> > > > 0b01000111) for length 7 short strings and (0b00000110) for int64.
> > This
> > > > would allow mix/max bounds to be used to definitively determine if
> > there
> > > > were only short strings vs other primitive types.
> > > >
> > > > In my mind if we were starting from scratch the latter encoding is
> more
> > > > useful [1] and would probably be preferred even if we don't know
> > exactly
> > > > how often it might  be used (I might be missing some advantages of
> the
> > > > current scheme though). The main argument against it is the practical
> > > > concern with compatibility of the existing encoding (which from
> > > > Parquet's perspective is experimental).
> > > >
> > > > Cheers,
> > > > Micah
> > > >
> > > > [1] It also appears better for a nano-optimization; it would take one
> > > less
> > > > instruction to extract length values for short strings (only a mask,
> > > > compared to a mask + shift).  I wouldn't expect this to actually
> impact
> > > > performance though.
> > > >
> > > >
> > > > Cheers,
> > > > Micah
> > > >
> > > >
> > > >
> > > > On Sun, Dec 8, 2024 at 6:23 AM Andrew Lamb <andrewlam...@gmail.com>
> > > wrote:
> > > >
> > > > > Thank you for this summary Micah -- I found it very helpful to
> have a
> > > > > summary rather than have to incrementally put it together based on
> > > > comments
> > > > > in the PR
> > > > >
> > > > > TLDR is that for the shredding question, I suggest the format not
> > > > > require/allow normalization and leave that as a feature for writer
> > > > > implementations if desired[1].
> > > > >
> > > > > >   In the long run this could include adding variant specific
> > > > > > statistics about which types are encoded, but in short run, this
> > > > exposes
> > > > > > the fact that the binary encoding of types appears to be
> suboptimal
> > > for
> > > > > use
> > > > > > in min/max statistics.
> > > > >
> > > > > In terms of statistics, I think trying to require the binary
> version
> > to
> > > > be
> > > > > useful for min/max calculations with existing statistics would be
> of
> > > very
> > > > > limited value (though I may misunderstand). It seems unlikely to me
> > > that
> > > > > there are many usecases where a single min/max value for a binary
> > > > > representation would be useful (though maybe it would make sense
> for
> > > > > separately stored columns)
> > > > >
> > > > > I think focusing on getting the variants into the spec and widely
> > > > supported
> > > > > is more important than optimizing the statistics. As you say adding
> > > > > specialized statistics for structured types is likely better in the
> > > long
> > > > > run.
> > > > >
> > > > > Andrew
> > > > >
> > > > > [1]:
> > > > >
> > >
> https://github.com/apache/parquet-format/pull/461#discussion_r1874839885
> > > > >
> > > > > On Fri, Dec 6, 2024 at 2:26 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Everyone,
> > > > > > I think we are making good progress on the shredding
> specification.
> > > > > >
> > > > > > There are  higher level concerns that might benefit from more
> > points
> > > of
> > > > > > view.  The relevant github discussions are linked below [2][3].
> > I've
> > > > also
> > > > > > provided a summary below (apologies for its length).
> > > > > >
> > > > > > For background, Variant is a semi-structured type akin to JSON
> but
> > > > with a
> > > > > > richer set of primitive types (e.g. timestamp).  The current spec
> > > also
> > > > > > defines "logical" types that have different physical
> > representations
> > > > [1]
> > > > > > (e.g. "Exact numeric" is composed of Decimal, int32, int64, etc;
> > and
> > > > > there
> > > > > > are two different physical representations of strings).
> > > > > >
> > > > > > Shredding refers to a storage optimization of extracting some
> > fields
> > > > and
> > > > > > storing them as individual columns.  e.g.
> > > > > >
> > > > > > With variants:
> > > > > > {"a": 1, "b": "abc"}
> > > > > > {"a": 1.2, "b": 1}
> > > > > >
> > > > > > The "a" field could be stored in a separate column of
> Decimal(p=2,
> > > > s=1),
> > > > > > which would accurately preserve the semantics of the value (and
> > then
> > > be
> > > > > > merged back with the "b" column which would be stored as the
> binary
> > > > > > encoding [1] separately).
> > > > > >
> > > > > > This scheme provides several optimizations for readers:
> > > > > > 1.  It allows for more efficient projection (e.g.
> > > > AS_NUMBER(EXTRACT($.a,
> > > > > > variant_col)) would not need to parse the binary encoding)
> > > > > > 2.  It potentially allows for using statistics for queries like
> > > > > > AS_NUMBER(EXTRACT($.a, variant_col)) > 2 could prune the row
> > > > > groups/pages.
> > > > > > 3.  It allows for potentially better compression of the values.
> > > > > >
> > > > > > Optimizations 1 and 2, require knowing that there is nothing
> stored
> > > in
> > > > > > Binary encoded form that could affect the results.  Either
> > everything
> > > > is
> > > > > > stored in the typed column or any value stored in binary encoded
> > form
> > > > > would
> > > > > > not affect the result (e.g. in the case above for field "a" if
> all
> > > > values
> > > > > > stored in binary encoded form are strings, that is if there is a
> > > third
> > > > > > value `{"a": "n/a" }` it would not affect the results since
> > AS_NUMBER
> > > > > would
> > > > > > return null for it).
> > > > > >
> > > > > > Optimization 1,2 and 3, is most effective when values are
> > > "normalized"
> > > > > > (e.g. int32 and int64 are both stored as int64 allowing for more
> > > values
> > > > > to
> > > > > > be stored as a normal column instead of  binary encoded form).
> > > > > >
> > > > > > Given this background, the open questions are:
> > > > > > 1.  Should parquet implementations have freedom to do
> normalization
> > > on
> > > > > > their own (i.e. consolidate different semantically equivalent
> > > physical
> > > > > > types into one shredded column and therefore potentially return
> > > > different
> > > > > > physical types to readers then were provided as input).  If
> > > > > implementations
> > > > > > can't do this optimization, the implications are either:
> > > > > >     a. Parquet implementations rely on higher level components to
> > try
> > > > to
> > > > > > normalize as much as possible
> > > > > >     b.  Parquet adds into the specification additional metadata
> at
> > > the
> > > > > > parquet level to still do normalization but reproduce the
> original
> > > > > physical
> > > > > > type.
> > > > > >     c.  We lose some level of optimization if semantically
> > equivalent
> > > > > types
> > > > > > are stored, since only one physical type can be shredded into its
> > own
> > > > > > column.
> > > > > >
> > > > > > 2.  Avoiding parsing binary encoded variant values is one of the
> > main
> > > > > goals
> > > > > > of shredding.  Therefore, it is useful to have as much metadata
> to
> > > make
> > > > > the
> > > > > > determination if binary encoded values are relevant to a
> > > > > projection/filter
> > > > > > operation.  In the long run this could include adding variant
> > > specific
> > > > > > statistics about which types are encoded, but in short run, this
> > > > exposes
> > > > > > the fact that the binary encoding of types appears to be
> suboptimal
> > > for
> > > > > use
> > > > > > in min/max statistics.
> > > > > >
> > > > > > IIUC, the current encoding appears to encode type information in
> > the
> > > > > least
> > > > > > significant bits of the first byte.  This implies that some types
> > can
> > > > > > make min/max values effectively useless for determining which set
> > of
> > > > > types
> > > > > > are encoded (e.g. short strings can make it impossible to tell if
> > all
> > > > > > values belong to the same type).  A secondary concern is that
> type
> > > ids
> > > > > are
> > > > > > tightly packed now.  This means introduction of a new physical
> type
> > > > that
> > > > > > is  semantically equivalent to existing type (e.g. float16) could
> > > make
> > > > > > stats less useful.
> > > > > >
> > > > > > Fixing the binary encoding issues would likely require a
> different
> > > > > version
> > > > > > in the binary specification [4] to accomodate data written before
> > the
> > > > > spec
> > > > > > was donated to the parquet community (as well as the baggage of
> > > > backwards
> > > > > > compatibility).
> > > > > >
> > > > > > Cheers,
> > > > > > Micah
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> > > > > > [2]
> > > > https://github.com/apache/parquet-format/pull/461/files#r1851373388
> > > > > > [3]
> > > > > >
> > > >
> > https://github.com/apache/parquet-format/pull/461#discussion_r1855433947
> > > > > > [4]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Open Variant Shredding Issues

Reply via email to