Thank you for this summary Micah -- I found it very helpful to have a summary rather than have to incrementally put it together based on comments in the PR
TLDR is that for the shredding question, I suggest the format not require/allow normalization and leave that as a feature for writer implementations if desired[1]. > In the long run this could include adding variant specific > statistics about which types are encoded, but in short run, this exposes > the fact that the binary encoding of types appears to be suboptimal for use > in min/max statistics. In terms of statistics, I think trying to require the binary version to be useful for min/max calculations with existing statistics would be of very limited value (though I may misunderstand). It seems unlikely to me that there are many usecases where a single min/max value for a binary representation would be useful (though maybe it would make sense for separately stored columns) I think focusing on getting the variants into the spec and widely supported is more important than optimizing the statistics. As you say adding specialized statistics for structured types is likely better in the long run. Andrew [1]: https://github.com/apache/parquet-format/pull/461#discussion_r1874839885 On Fri, Dec 6, 2024 at 2:26 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Everyone, > I think we are making good progress on the shredding specification. > > There are higher level concerns that might benefit from more points of > view. The relevant github discussions are linked below [2][3]. I've also > provided a summary below (apologies for its length). > > For background, Variant is a semi-structured type akin to JSON but with a > richer set of primitive types (e.g. timestamp). The current spec also > defines "logical" types that have different physical representations [1] > (e.g. "Exact numeric" is composed of Decimal, int32, int64, etc; and there > are two different physical representations of strings). > > Shredding refers to a storage optimization of extracting some fields and > storing them as individual columns. e.g. > > With variants: > {"a": 1, "b": "abc"} > {"a": 1.2, "b": 1} > > The "a" field could be stored in a separate column of Decimal(p=2, s=1), > which would accurately preserve the semantics of the value (and then be > merged back with the "b" column which would be stored as the binary > encoding [1] separately). > > This scheme provides several optimizations for readers: > 1. It allows for more efficient projection (e.g. AS_NUMBER(EXTRACT($.a, > variant_col)) would not need to parse the binary encoding) > 2. It potentially allows for using statistics for queries like > AS_NUMBER(EXTRACT($.a, variant_col)) > 2 could prune the row groups/pages. > 3. It allows for potentially better compression of the values. > > Optimizations 1 and 2, require knowing that there is nothing stored in > Binary encoded form that could affect the results. Either everything is > stored in the typed column or any value stored in binary encoded form would > not affect the result (e.g. in the case above for field "a" if all values > stored in binary encoded form are strings, that is if there is a third > value `{"a": "n/a" }` it would not affect the results since AS_NUMBER would > return null for it). > > Optimization 1,2 and 3, is most effective when values are "normalized" > (e.g. int32 and int64 are both stored as int64 allowing for more values to > be stored as a normal column instead of binary encoded form). > > Given this background, the open questions are: > 1. Should parquet implementations have freedom to do normalization on > their own (i.e. consolidate different semantically equivalent physical > types into one shredded column and therefore potentially return different > physical types to readers then were provided as input). If implementations > can't do this optimization, the implications are either: > a. Parquet implementations rely on higher level components to try to > normalize as much as possible > b. Parquet adds into the specification additional metadata at the > parquet level to still do normalization but reproduce the original physical > type. > c. We lose some level of optimization if semantically equivalent types > are stored, since only one physical type can be shredded into its own > column. > > 2. Avoiding parsing binary encoded variant values is one of the main goals > of shredding. Therefore, it is useful to have as much metadata to make the > determination if binary encoded values are relevant to a projection/filter > operation. In the long run this could include adding variant specific > statistics about which types are encoded, but in short run, this exposes > the fact that the binary encoding of types appears to be suboptimal for use > in min/max statistics. > > IIUC, the current encoding appears to encode type information in the least > significant bits of the first byte. This implies that some types can > make min/max values effectively useless for determining which set of types > are encoded (e.g. short strings can make it impossible to tell if all > values belong to the same type). A secondary concern is that type ids are > tightly packed now. This means introduction of a new physical type that > is semantically equivalent to existing type (e.g. float16) could make > stats less useful. > > Fixing the binary encoding issues would likely require a different version > in the binary specification [4] to accomodate data written before the spec > was donated to the parquet community (as well as the baggage of backwards > compatibility). > > Cheers, > Micah > > > > > > [1] > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > [2] https://github.com/apache/parquet-format/pull/461/files#r1851373388 > [3] > https://github.com/apache/parquet-format/pull/461#discussion_r1855433947 > [4] > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding >