I see -- thank you for the clarification > For instance, if presented with a mix of physical int16 and int32 values for field, it seems > at the storage level it is going to be more efficient to shred all values > in an int32 column (this would be the physical type anyways), and add extra > metadata to reconstruct the original physical types if needed.
In my opinion, the file format would be simpler to understand, implement, and reason about if it did not permit shredding fields that had different types into a primitive column. For example, in this example, "a" has int32 type in the first record but int64 type in the second and I would suggest not allowing it to be shredded into an int64 column. {"a": 1i32, "b": "abc"} {"a": 2i64, "b": "def"} The processing engine would either have to cast "a" to int64 before writing, shred it to another column with variant type, or not shred it. Not permitting shredding a field with multiple types seems like it would avoid the complexities you are describing: 1. If a field is shredded to a new column, the writer simply writes values of that type (and errors if it sees a different type on some subsequent record) - no coercion / casting required 2. If a field is shredded to a new column, it is guaranteed to have the same type and thus statistics should work well At least with SQL / structured databases, having fields with different types in different records can't be well represented in the data model, so the reader would likely impose a common type anyways. Andrew On Sun, Dec 8, 2024 at 5:23 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Andrew, > > Thanks for the input. I think these design choices mostly revolve around > distributions of physical types we expect to see for specific fields. I've > added some more examples below. > > > > TLDR is that for the shredding question, I suggest the format not > > require/allow normalization and leave that as a feature for writer > > implementations if desired[1]. > > > This makes sense to me as a reasonable path forwards. It leaves the > question of whether more metadata is needed in the case that clients don't > want the parquet implementation to do normalization. For instance, if > presented with a mix of physical int16 and int32 values for field, it seems > at the storage level it is going to be more efficient to shred all values > in an int32 column (this would be the physical type anyways), and add extra > metadata to reconstruct the original physical types if needed. > > In terms of statistics, I think trying to require the binary version to be > > useful for min/max calculations with existing statistics would be of very > > limited value (though I may misunderstand). > > > I agree with this without shredding. In most cases I expect top level > variants to mostly be "Objects", so min/max probably doesn't add too much > value. When thinking about shredding the general expectation is that > fields are mostly of the same type and there will be a fair number of > shredded primitive types. To better understand the bit ordering > implications we can look at a concrete, possibly contrived, example. > > Consider a field shredded into an int32 typed value. The remaining data > (encoded in binary form) has short strings of lengths between 0 (leading > byte 0b00000001) and length 7 (leading byte 0b00011101). > > In the current binary encoding scheme, we cannot determine whether the > binary encoded values contain an int64 value (IIUC leading byte is > 0b00011000). This implies that the statistics for the shredded int32 > columns cannot be used for pruning, and for projecting an exact numeric > value the entire binary encoded column needs to be parsed. If the > basic_type data was encoded in the most significant bits we would get > (leading byte 0b01000000) for an empty short string, (leading byte > 0b01000111) for length 7 short strings and (0b00000110) for int64. This > would allow mix/max bounds to be used to definitively determine if there > were only short strings vs other primitive types. > > In my mind if we were starting from scratch the latter encoding is more > useful [1] and would probably be preferred even if we don't know exactly > how often it might be used (I might be missing some advantages of the > current scheme though). The main argument against it is the practical > concern with compatibility of the existing encoding (which from > Parquet's perspective is experimental). > > Cheers, > Micah > > [1] It also appears better for a nano-optimization; it would take one less > instruction to extract length values for short strings (only a mask, > compared to a mask + shift). I wouldn't expect this to actually impact > performance though. > > > Cheers, > Micah > > > > On Sun, Dec 8, 2024 at 6:23 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > > > Thank you for this summary Micah -- I found it very helpful to have a > > summary rather than have to incrementally put it together based on > comments > > in the PR > > > > TLDR is that for the shredding question, I suggest the format not > > require/allow normalization and leave that as a feature for writer > > implementations if desired[1]. > > > > > In the long run this could include adding variant specific > > > statistics about which types are encoded, but in short run, this > exposes > > > the fact that the binary encoding of types appears to be suboptimal for > > use > > > in min/max statistics. > > > > In terms of statistics, I think trying to require the binary version to > be > > useful for min/max calculations with existing statistics would be of very > > limited value (though I may misunderstand). It seems unlikely to me that > > there are many usecases where a single min/max value for a binary > > representation would be useful (though maybe it would make sense for > > separately stored columns) > > > > I think focusing on getting the variants into the spec and widely > supported > > is more important than optimizing the statistics. As you say adding > > specialized statistics for structured types is likely better in the long > > run. > > > > Andrew > > > > [1]: > > https://github.com/apache/parquet-format/pull/461#discussion_r1874839885 > > > > On Fri, Dec 6, 2024 at 2:26 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > Hi Everyone, > > > I think we are making good progress on the shredding specification. > > > > > > There are higher level concerns that might benefit from more points of > > > view. The relevant github discussions are linked below [2][3]. I've > also > > > provided a summary below (apologies for its length). > > > > > > For background, Variant is a semi-structured type akin to JSON but > with a > > > richer set of primitive types (e.g. timestamp). The current spec also > > > defines "logical" types that have different physical representations > [1] > > > (e.g. "Exact numeric" is composed of Decimal, int32, int64, etc; and > > there > > > are two different physical representations of strings). > > > > > > Shredding refers to a storage optimization of extracting some fields > and > > > storing them as individual columns. e.g. > > > > > > With variants: > > > {"a": 1, "b": "abc"} > > > {"a": 1.2, "b": 1} > > > > > > The "a" field could be stored in a separate column of Decimal(p=2, > s=1), > > > which would accurately preserve the semantics of the value (and then be > > > merged back with the "b" column which would be stored as the binary > > > encoding [1] separately). > > > > > > This scheme provides several optimizations for readers: > > > 1. It allows for more efficient projection (e.g. > AS_NUMBER(EXTRACT($.a, > > > variant_col)) would not need to parse the binary encoding) > > > 2. It potentially allows for using statistics for queries like > > > AS_NUMBER(EXTRACT($.a, variant_col)) > 2 could prune the row > > groups/pages. > > > 3. It allows for potentially better compression of the values. > > > > > > Optimizations 1 and 2, require knowing that there is nothing stored in > > > Binary encoded form that could affect the results. Either everything > is > > > stored in the typed column or any value stored in binary encoded form > > would > > > not affect the result (e.g. in the case above for field "a" if all > values > > > stored in binary encoded form are strings, that is if there is a third > > > value `{"a": "n/a" }` it would not affect the results since AS_NUMBER > > would > > > return null for it). > > > > > > Optimization 1,2 and 3, is most effective when values are "normalized" > > > (e.g. int32 and int64 are both stored as int64 allowing for more values > > to > > > be stored as a normal column instead of binary encoded form). > > > > > > Given this background, the open questions are: > > > 1. Should parquet implementations have freedom to do normalization on > > > their own (i.e. consolidate different semantically equivalent physical > > > types into one shredded column and therefore potentially return > different > > > physical types to readers then were provided as input). If > > implementations > > > can't do this optimization, the implications are either: > > > a. Parquet implementations rely on higher level components to try > to > > > normalize as much as possible > > > b. Parquet adds into the specification additional metadata at the > > > parquet level to still do normalization but reproduce the original > > physical > > > type. > > > c. We lose some level of optimization if semantically equivalent > > types > > > are stored, since only one physical type can be shredded into its own > > > column. > > > > > > 2. Avoiding parsing binary encoded variant values is one of the main > > goals > > > of shredding. Therefore, it is useful to have as much metadata to make > > the > > > determination if binary encoded values are relevant to a > > projection/filter > > > operation. In the long run this could include adding variant specific > > > statistics about which types are encoded, but in short run, this > exposes > > > the fact that the binary encoding of types appears to be suboptimal for > > use > > > in min/max statistics. > > > > > > IIUC, the current encoding appears to encode type information in the > > least > > > significant bits of the first byte. This implies that some types can > > > make min/max values effectively useless for determining which set of > > types > > > are encoded (e.g. short strings can make it impossible to tell if all > > > values belong to the same type). A secondary concern is that type ids > > are > > > tightly packed now. This means introduction of a new physical type > that > > > is semantically equivalent to existing type (e.g. float16) could make > > > stats less useful. > > > > > > Fixing the binary encoding issues would likely require a different > > version > > > in the binary specification [4] to accomodate data written before the > > spec > > > was donated to the parquet community (as well as the baggage of > backwards > > > compatibility). > > > > > > Cheers, > > > Micah > > > > > > > > > > > > > > > > > > [1] > > > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > > > [2] > https://github.com/apache/parquet-format/pull/461/files#r1851373388 > > > [3] > > > > https://github.com/apache/parquet-format/pull/461#discussion_r1855433947 > > > [4] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding > > > > > >