Hi Ryan, > In addition to being an important and basic guarantee of the format, I > think there are a few other good reasons for this. Normalizing in the > engine keeps the spec small while remaining flexible and expressive. For > example, the value 12.00 (decimal(4,2)) is equivalent to the 12 (int8) for > some use cases, but not in others. If Parquet requires that 12.00 is always > equivalent to 12, then values can't be trusted for the cases that use > decimals for exact precision. Even if normalization is optional, you can't > trust that it wasn't normalized at write time. In addition, the spec would > need a lot more detail because Parquet would need to document rules for > normalization. For instance, when 12 is stored as an int16, should it be > normalized at read time to an int8? What about storing 12 as 12.00 > (decimal(4,2))?
Could you clarify your concerns here, the specification appears to already at least partially do exactly this via "Type equivalence class" (formally known as Logical type) [1] of "exact numeric". If we don't want to believe parquet should be making this determination maybe it should be removed from the spec? I'm OK with the consensus expressed here with no normalization and no extra metadata. These can always be added in a follow-up revision if we find the existing modelling needs to be improved. But even if we were to allow Parquet to do this, we've already decided not > to add similar optimizations that preserve types on the basis that they are > not very useful. Earlier in our discussions, I suggested allowing multiple > shredded types for a given field name. For instance, shredding to columns > with different decimal scales. Other people pointed out that while this > would be useful in theory, data tends to be fairly uniformly typed in > practice and it wasn't worth the complexity. This seems reasonable. I'd therefore expect that allowing reference implementations to shred data by taking the schema of a field the first time it appears as a reasonable heuristic? More generally it might be good to start discussing what API changes we expect are needed to support shredding in reference implementations? Thanks, Micah [1] https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types On Wed, Dec 11, 2024 at 9:18 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > For normalization I agree with Ryan. I was part of those other discussions > and I think > it does seem like this is an engine concern and not a storage one. > > I'm also ok with basically getting no value from min/max of non-shredded > fields. > > On Wed, Dec 11, 2024 at 4:35 AM Antoine Pitrou <anto...@python.org> wrote: > > > On Mon, 9 Dec 2024 16:33:51 -0800 > > "rdb...@gmail.com" > > <rdb...@gmail.com> wrote: > > > I think that Parquet should exactly reproduce the data that is written > to > > > files, rather than either allowing or requiring Parquet implementations > > to > > > normalize types. To me, that's a fundamental guarantee of the storage > > > layer. The compute layer can decide to normalize types and take actions > > to > > > make storage more efficient, but storage should not modify the data > that > > is > > > passed to it. > > > > FWIW, I agree with this. > > > > Regards > > > > Antoine. > > > > > > >