It's up to the writer whether to fail or not. The spec simply states how data can be shredded and how it must be interpreted when shredded. If a writer chooses to fail when data doesn't fit the shredding spec, that's up to the writer. I don't think that's going to be very common, though. I think the idea is to be very flexible.
On Fri, Dec 20, 2024 at 12:23 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > That is what I would expect -- and I would expect an error if some > > subsequent variant instance had a different data type. > > > IIUC, and I mostly agree with, the shredding spec as currently proposed > does not fail in this case it just loses performance benefits. I think > this is a reasonable compromise, given we expect mostly identical schemas > but sometimes type conflicts can't be helped and want to be resilient to > them. > > On Mon, Dec 16, 2024 at 3:10 AM Andrew Lamb <andrewlam...@gmail.com> > wrote: > > > > This seems reasonable. I'd therefore expect that allowing reference > > > implementations to shred data by taking the schema of a field the first > > > time it appears as a reasonable heuristic? > > > > That is what I would expect -- and I would expect an error if some > > subsequent variant instance had a different data type. > > > > This is the same behavior I observed when trying to save json data into a > > parquet struct column using pyarrow. If some subsequent record contains a > > different schema than the first, a runtime error is thrown. > > > > Andrew > > > > On Mon, Dec 16, 2024 at 12:07 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > Hi Ryan, > > > > > > > In addition to being an important and basic guarantee of the format, > I > > > > think there are a few other good reasons for this. Normalizing in the > > > > engine keeps the spec small while remaining flexible and expressive. > > For > > > > example, the value 12.00 (decimal(4,2)) is equivalent to the 12 > (int8) > > > for > > > > some use cases, but not in others. If Parquet requires that 12.00 is > > > always > > > > equivalent to 12, then values can't be trusted for the cases that use > > > > decimals for exact precision. Even if normalization is optional, you > > > can't > > > > trust that it wasn't normalized at write time. In addition, the spec > > > would > > > > need a lot more detail because Parquet would need to document rules > for > > > > normalization. For instance, when 12 is stored as an int16, should it > > be > > > > normalized at read time to an int8? What about storing 12 as 12.00 > > > > (decimal(4,2))? > > > > > > > > > Could you clarify your concerns here, the specification appears to > > already > > > at least partially do exactly this via "Type equivalence class" > (formally > > > known as Logical type) [1] of "exact numeric". If we don't want to > > believe > > > parquet should be making this determination maybe it should be removed > > from > > > the spec? I'm OK with the consensus expressed here with no > normalization > > > and no extra metadata. These can always be added in a follow-up > revision > > > if we find the existing modelling needs to be improved. > > > > > > > > > But even if we were to allow Parquet to do this, we've already decided > > not > > > > to add similar optimizations that preserve types on the basis that > they > > > are > > > > not very useful. Earlier in our discussions, I suggested allowing > > > multiple > > > > shredded types for a given field name. For instance, shredding to > > columns > > > > with different decimal scales. Other people pointed out that while > this > > > > would be useful in theory, data tends to be fairly uniformly typed in > > > > practice and it wasn't worth the complexity. > > > > > > > > > This seems reasonable. I'd therefore expect that allowing reference > > > implementations to shred data by taking the schema of a field the first > > > time it appears as a reasonable heuristic? More generally it might be > > good > > > to start discussing what API changes we expect are needed to support > > > shredding in reference implementations? > > > > > > Thanks, > > > Micah > > > > > > > > > [1] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types > > > > > > > > > > > > On Wed, Dec 11, 2024 at 9:18 AM Russell Spitzer < > > russell.spit...@gmail.com > > > > > > > wrote: > > > > > > > For normalization I agree with Ryan. I was part of those other > > > discussions > > > > and I think > > > > it does seem like this is an engine concern and not a storage one. > > > > > > > > I'm also ok with basically getting no value from min/max of > > non-shredded > > > > fields. > > > > > > > > On Wed, Dec 11, 2024 at 4:35 AM Antoine Pitrou <anto...@python.org> > > > wrote: > > > > > > > > > On Mon, 9 Dec 2024 16:33:51 -0800 > > > > > "rdb...@gmail.com" > > > > > <rdb...@gmail.com> wrote: > > > > > > I think that Parquet should exactly reproduce the data that is > > > written > > > > to > > > > > > files, rather than either allowing or requiring Parquet > > > implementations > > > > > to > > > > > > normalize types. To me, that's a fundamental guarantee of the > > storage > > > > > > layer. The compute layer can decide to normalize types and take > > > actions > > > > > to > > > > > > make storage more efficient, but storage should not modify the > data > > > > that > > > > > is > > > > > > passed to it. > > > > > > > > > > FWIW, I agree with this. > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > >