> This seems reasonable. I'd therefore expect that allowing reference > implementations to shred data by taking the schema of a field the first > time it appears as a reasonable heuristic?
That is what I would expect -- and I would expect an error if some subsequent variant instance had a different data type. This is the same behavior I observed when trying to save json data into a parquet struct column using pyarrow. If some subsequent record contains a different schema than the first, a runtime error is thrown. Andrew On Mon, Dec 16, 2024 at 12:07 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Ryan, > > > In addition to being an important and basic guarantee of the format, I > > think there are a few other good reasons for this. Normalizing in the > > engine keeps the spec small while remaining flexible and expressive. For > > example, the value 12.00 (decimal(4,2)) is equivalent to the 12 (int8) > for > > some use cases, but not in others. If Parquet requires that 12.00 is > always > > equivalent to 12, then values can't be trusted for the cases that use > > decimals for exact precision. Even if normalization is optional, you > can't > > trust that it wasn't normalized at write time. In addition, the spec > would > > need a lot more detail because Parquet would need to document rules for > > normalization. For instance, when 12 is stored as an int16, should it be > > normalized at read time to an int8? What about storing 12 as 12.00 > > (decimal(4,2))? > > > Could you clarify your concerns here, the specification appears to already > at least partially do exactly this via "Type equivalence class" (formally > known as Logical type) [1] of "exact numeric". If we don't want to believe > parquet should be making this determination maybe it should be removed from > the spec? I'm OK with the consensus expressed here with no normalization > and no extra metadata. These can always be added in a follow-up revision > if we find the existing modelling needs to be improved. > > > But even if we were to allow Parquet to do this, we've already decided not > > to add similar optimizations that preserve types on the basis that they > are > > not very useful. Earlier in our discussions, I suggested allowing > multiple > > shredded types for a given field name. For instance, shredding to columns > > with different decimal scales. Other people pointed out that while this > > would be useful in theory, data tends to be fairly uniformly typed in > > practice and it wasn't worth the complexity. > > > This seems reasonable. I'd therefore expect that allowing reference > implementations to shred data by taking the schema of a field the first > time it appears as a reasonable heuristic? More generally it might be good > to start discussing what API changes we expect are needed to support > shredding in reference implementations? > > Thanks, > Micah > > > [1] > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types > > > > On Wed, Dec 11, 2024 at 9:18 AM Russell Spitzer <russell.spit...@gmail.com > > > wrote: > > > For normalization I agree with Ryan. I was part of those other > discussions > > and I think > > it does seem like this is an engine concern and not a storage one. > > > > I'm also ok with basically getting no value from min/max of non-shredded > > fields. > > > > On Wed, Dec 11, 2024 at 4:35 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > On Mon, 9 Dec 2024 16:33:51 -0800 > > > "rdb...@gmail.com" > > > <rdb...@gmail.com> wrote: > > > > I think that Parquet should exactly reproduce the data that is > written > > to > > > > files, rather than either allowing or requiring Parquet > implementations > > > to > > > > normalize types. To me, that's a fundamental guarantee of the storage > > > > layer. The compute layer can decide to normalize types and take > actions > > > to > > > > make storage more efficient, but storage should not modify the data > > that > > > is > > > > passed to it. > > > > > > FWIW, I agree with this. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > >