Re: [DISCUSS] Open Variant Shredding Issues

rdb...@gmail.com Fri, 20 Dec 2024 13:57:23 -0800

It's up to the writer whether to fail or not. The spec simply states how
data can be shredded and how it must be interpreted when shredded. If a
writer chooses to fail when data doesn't fit the shredding spec, that's up
to the writer. I don't think that's going to be very common, though. I
think the idea is to be very flexible.


On Fri, Dec 20, 2024 at 12:23 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> >
> > That is what I would expect -- and I would expect an error if some
> > subsequent variant instance had a different data type.
>
>
> IIUC, and I mostly agree with, the shredding spec as currently proposed
> does not fail in this case it just loses performance benefits.  I think
> this is a reasonable compromise, given we expect mostly identical schemas
> but sometimes type conflicts can't be helped and want to be resilient to
> them.
>
> On Mon, Dec 16, 2024 at 3:10 AM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
>
> > > This seems reasonable. I'd therefore expect that allowing reference
> > > implementations to shred data by taking the schema of a field the first
> > > time it appears as a reasonable heuristic?
> >
> > That is what I would expect -- and I would expect an error if some
> > subsequent variant instance had a different data type.
> >
> > This is the same behavior I observed when trying to save json data into a
> > parquet struct column using pyarrow. If some subsequent record contains a
> > different schema than the first, a runtime error is thrown.
> >
> > Andrew
> >
> > On Mon, Dec 16, 2024 at 12:07 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hi Ryan,
> > >
> > > > In addition to being an important and basic guarantee of the format,
> I
> > > > think there are a few other good reasons for this. Normalizing in the
> > > > engine keeps the spec small while remaining flexible and expressive.
> > For
> > > > example, the value 12.00 (decimal(4,2)) is equivalent to the 12
> (int8)
> > > for
> > > > some use cases, but not in others. If Parquet requires that 12.00 is
> > > always
> > > > equivalent to 12, then values can't be trusted for the cases that use
> > > > decimals for exact precision. Even if normalization is optional, you
> > > can't
> > > > trust that it wasn't normalized at write time. In addition, the spec
> > > would
> > > > need a lot more detail because Parquet would need to document rules
> for
> > > > normalization. For instance, when 12 is stored as an int16, should it
> > be
> > > > normalized at read time to an int8? What about storing 12 as 12.00
> > > > (decimal(4,2))?
> > >
> > >
> > > Could you clarify your concerns here, the specification appears to
> > already
> > > at least partially do exactly this via "Type equivalence class"
> (formally
> > > known as Logical type) [1] of "exact numeric".  If we don't want to
> > believe
> > > parquet should be making this determination maybe it should be removed
> > from
> > > the spec?  I'm OK with the consensus expressed here with no
> normalization
> > > and no extra metadata.  These can always be added in a follow-up
> revision
> > > if we find the existing modelling needs to be improved.
> > >
> > >
> > > But even if we were to allow Parquet to do this, we've already decided
> > not
> > > > to add similar optimizations that preserve types on the basis that
> they
> > > are
> > > > not very useful. Earlier in our discussions, I suggested allowing
> > > multiple
> > > > shredded types for a given field name. For instance, shredding to
> > columns
> > > > with different decimal scales. Other people pointed out that while
> this
> > > > would be useful in theory, data tends to be fairly uniformly typed in
> > > > practice and it wasn't worth the complexity.
> > >
> > >
> > > This seems reasonable. I'd therefore expect that allowing reference
> > > implementations to shred data by taking the schema of a field the first
> > > time it appears as a reasonable heuristic?  More generally it might be
> > good
> > > to start discussing what API changes we expect are needed to support
> > > shredding in reference implementations?
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
> > >
> > >
> > >
> > > On Wed, Dec 11, 2024 at 9:18 AM Russell Spitzer <
> > russell.spit...@gmail.com
> > > >
> > > wrote:
> > >
> > > > For normalization I agree with Ryan. I was part of those other
> > > discussions
> > > > and I think
> > > > it does seem like this is an engine concern and not a storage one.
> > > >
> > > > I'm also ok with basically getting no value from min/max of
> > non-shredded
> > > > fields.
> > > >
> > > > On Wed, Dec 11, 2024 at 4:35 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > > > On Mon, 9 Dec 2024 16:33:51 -0800
> > > > > "rdb...@gmail.com"
> > > > > <rdb...@gmail.com> wrote:
> > > > > > I think that Parquet should exactly reproduce the data that is
> > > written
> > > > to
> > > > > > files, rather than either allowing or requiring Parquet
> > > implementations
> > > > > to
> > > > > > normalize types. To me, that's a fundamental guarantee of the
> > storage
> > > > > > layer. The compute layer can decide to normalize types and take
> > > actions
> > > > > to
> > > > > > make storage more efficient, but storage should not modify the
> data
> > > > that
> > > > > is
> > > > > > passed to it.
> > > > >
> > > > > FWIW, I agree with this.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Open Variant Shredding Issues

Reply via email to