> This seems reasonable. I'd therefore expect that allowing reference
> implementations to shred data by taking the schema of a field the first
> time it appears as a reasonable heuristic?

That is what I would expect -- and I would expect an error if some
subsequent variant instance had a different data type.

This is the same behavior I observed when trying to save json data into a
parquet struct column using pyarrow. If some subsequent record contains a
different schema than the first, a runtime error is thrown.

Andrew

On Mon, Dec 16, 2024 at 12:07 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Ryan,
>
> > In addition to being an important and basic guarantee of the format, I
> > think there are a few other good reasons for this. Normalizing in the
> > engine keeps the spec small while remaining flexible and expressive. For
> > example, the value 12.00 (decimal(4,2)) is equivalent to the 12 (int8)
> for
> > some use cases, but not in others. If Parquet requires that 12.00 is
> always
> > equivalent to 12, then values can't be trusted for the cases that use
> > decimals for exact precision. Even if normalization is optional, you
> can't
> > trust that it wasn't normalized at write time. In addition, the spec
> would
> > need a lot more detail because Parquet would need to document rules for
> > normalization. For instance, when 12 is stored as an int16, should it be
> > normalized at read time to an int8? What about storing 12 as 12.00
> > (decimal(4,2))?
>
>
> Could you clarify your concerns here, the specification appears to already
> at least partially do exactly this via "Type equivalence class" (formally
> known as Logical type) [1] of "exact numeric".  If we don't want to believe
> parquet should be making this determination maybe it should be removed from
> the spec?  I'm OK with the consensus expressed here with no normalization
> and no extra metadata.  These can always be added in a follow-up revision
> if we find the existing modelling needs to be improved.
>
>
> But even if we were to allow Parquet to do this, we've already decided not
> > to add similar optimizations that preserve types on the basis that they
> are
> > not very useful. Earlier in our discussions, I suggested allowing
> multiple
> > shredded types for a given field name. For instance, shredding to columns
> > with different decimal scales. Other people pointed out that while this
> > would be useful in theory, data tends to be fairly uniformly typed in
> > practice and it wasn't worth the complexity.
>
>
> This seems reasonable. I'd therefore expect that allowing reference
> implementations to shred data by taking the schema of a field the first
> time it appears as a reasonable heuristic?  More generally it might be good
> to start discussing what API changes we expect are needed to support
> shredding in reference implementations?
>
> Thanks,
> Micah
>
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
>
>
>
> On Wed, Dec 11, 2024 at 9:18 AM Russell Spitzer <russell.spit...@gmail.com
> >
> wrote:
>
> > For normalization I agree with Ryan. I was part of those other
> discussions
> > and I think
> > it does seem like this is an engine concern and not a storage one.
> >
> > I'm also ok with basically getting no value from min/max of non-shredded
> > fields.
> >
> > On Wed, Dec 11, 2024 at 4:35 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > > On Mon, 9 Dec 2024 16:33:51 -0800
> > > "rdb...@gmail.com"
> > > <rdb...@gmail.com> wrote:
> > > > I think that Parquet should exactly reproduce the data that is
> written
> > to
> > > > files, rather than either allowing or requiring Parquet
> implementations
> > > to
> > > > normalize types. To me, that's a fundamental guarantee of the storage
> > > > layer. The compute layer can decide to normalize types and take
> actions
> > > to
> > > > make storage more efficient, but storage should not modify the data
> > that
> > > is
> > > > passed to it.
> > >
> > > FWIW, I agree with this.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Reply via email to