> > The processing engine would either have to cast "a" to int64 before > writing, shred it to another column with variant type, or not shred it.
I think this might be key to the discussion. In my mind shredding is a storage level concern, and perhaps a shared concern between engines and storage when lower level optimizations are required. IMO Parquet implementations should be able to have a function signature like `writeVariant(const std::string& metadata, const std::string& value)` and store it in a reasonably efficient manner, assuming some level of common schema between different variant objects. As an analogy, most parquet writers will attempt to dictionary encode data as an optimization. Advanced writers will specifically pass through a dictionary for the values to be encoded if they are already dictionary encoded in memory. 1. If a field is shredded to a new column, the writer simply writes values > of that type (and errors if it sees a different type on some subsequent > record) - no coercion / casting required 2. If a field is shredded to a new column, it is guaranteed to have the > same type and thus statistics should work well I think in practice this is challenging, since one doesn't necessarily know the physical types beforehand (thus leading to costly rewrites if not everything could be shredded), it also misses out on a room for substantial optimizations at the storage level (suppose 19999 values are int16 and 1 is of a different type). The current proposal is closer to this where the shredded type is specific to one type and any type that doesn't fall into that category is binary encoded. As discussed in the thread already, his approach leaves the validity of statistics in question depending on what types are binary encoded. At least with SQL / structured databases, having fields with different > types in different records can't be well represented in the data model, so > the reader would likely impose a common type anyways. I think this describes the common case, readers typically only care about a field that contains a number. This is why it is not necessarily unreasonable to let parquet implementations do normalization automatically. I agree mixing types in SQL is a little awkward but as an example JSON is supported in the SQL standard, and it ends up being an opaque object that can be operated on with accessor functions, that either error on unexpected types or return nulls (for example see Postgres's functions [1]) [1] https://www.postgresql.org/docs/current/functions-json.html Thanks, Micah On Sun, Dec 8, 2024 at 3:00 PM Andrew Lamb <andrewlam...@gmail.com> wrote: > I see -- thank you for the clarification > > > For instance, if presented with a mix of physical int16 and int32 values > for field, it seems > > at the storage level it is going to be more efficient to shred all values > > in an int32 column (this would be the physical type anyways), and add > extra > > metadata to reconstruct the original physical types if needed. > > In my opinion, the file format would be simpler to understand, implement, > and reason about if it did not permit shredding fields that had different > types into a primitive column. > > For example, in this example, "a" has int32 type in the first record but > int64 type in the second and I would suggest not allowing it to be shredded > into an int64 column. > {"a": 1i32, "b": "abc"} > {"a": 2i64, "b": "def"} > > The processing engine would either have to cast "a" to int64 before > writing, shred it to another column with variant type, or not shred it. > > Not permitting shredding a field with multiple types seems like it would > avoid the complexities you are describing: > 1. If a field is shredded to a new column, the writer simply writes values > of that type (and errors if it sees a different type on some subsequent > record) - no coercion / casting required > 2. If a field is shredded to a new column, it is guaranteed to have the > same type and thus statistics should work well > > At least with SQL / structured databases, having fields with different > types in different records can't be well represented in the data model, so > the reader would likely impose a common type anyways. > > Andrew > > > > > On Sun, Dec 8, 2024 at 5:23 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > Hi Andrew, > > > > Thanks for the input. I think these design choices mostly revolve around > > distributions of physical types we expect to see for specific fields. > I've > > added some more examples below. > > > > > > > TLDR is that for the shredding question, I suggest the format not > > > require/allow normalization and leave that as a feature for writer > > > implementations if desired[1]. > > > > > > This makes sense to me as a reasonable path forwards. It leaves the > > question of whether more metadata is needed in the case that clients > don't > > want the parquet implementation to do normalization. For instance, if > > presented with a mix of physical int16 and int32 values for field, it > seems > > at the storage level it is going to be more efficient to shred all values > > in an int32 column (this would be the physical type anyways), and add > extra > > metadata to reconstruct the original physical types if needed. > > > > In terms of statistics, I think trying to require the binary version to > be > > > useful for min/max calculations with existing statistics would be of > very > > > limited value (though I may misunderstand). > > > > > > I agree with this without shredding. In most cases I expect top level > > variants to mostly be "Objects", so min/max probably doesn't add too much > > value. When thinking about shredding the general expectation is that > > fields are mostly of the same type and there will be a fair number of > > shredded primitive types. To better understand the bit ordering > > implications we can look at a concrete, possibly contrived, example. > > > > Consider a field shredded into an int32 typed value. The remaining data > > (encoded in binary form) has short strings of lengths between 0 (leading > > byte 0b00000001) and length 7 (leading byte 0b00011101). > > > > In the current binary encoding scheme, we cannot determine whether the > > binary encoded values contain an int64 value (IIUC leading byte is > > 0b00011000). This implies that the statistics for the shredded int32 > > columns cannot be used for pruning, and for projecting an exact numeric > > value the entire binary encoded column needs to be parsed. If the > > basic_type data was encoded in the most significant bits we would get > > (leading byte 0b01000000) for an empty short string, (leading byte > > 0b01000111) for length 7 short strings and (0b00000110) for int64. This > > would allow mix/max bounds to be used to definitively determine if there > > were only short strings vs other primitive types. > > > > In my mind if we were starting from scratch the latter encoding is more > > useful [1] and would probably be preferred even if we don't know exactly > > how often it might be used (I might be missing some advantages of the > > current scheme though). The main argument against it is the practical > > concern with compatibility of the existing encoding (which from > > Parquet's perspective is experimental). > > > > Cheers, > > Micah > > > > [1] It also appears better for a nano-optimization; it would take one > less > > instruction to extract length values for short strings (only a mask, > > compared to a mask + shift). I wouldn't expect this to actually impact > > performance though. > > > > > > Cheers, > > Micah > > > > > > > > On Sun, Dec 8, 2024 at 6:23 AM Andrew Lamb <andrewlam...@gmail.com> > wrote: > > > > > Thank you for this summary Micah -- I found it very helpful to have a > > > summary rather than have to incrementally put it together based on > > comments > > > in the PR > > > > > > TLDR is that for the shredding question, I suggest the format not > > > require/allow normalization and leave that as a feature for writer > > > implementations if desired[1]. > > > > > > > In the long run this could include adding variant specific > > > > statistics about which types are encoded, but in short run, this > > exposes > > > > the fact that the binary encoding of types appears to be suboptimal > for > > > use > > > > in min/max statistics. > > > > > > In terms of statistics, I think trying to require the binary version to > > be > > > useful for min/max calculations with existing statistics would be of > very > > > limited value (though I may misunderstand). It seems unlikely to me > that > > > there are many usecases where a single min/max value for a binary > > > representation would be useful (though maybe it would make sense for > > > separately stored columns) > > > > > > I think focusing on getting the variants into the spec and widely > > supported > > > is more important than optimizing the statistics. As you say adding > > > specialized statistics for structured types is likely better in the > long > > > run. > > > > > > Andrew > > > > > > [1]: > > > > https://github.com/apache/parquet-format/pull/461#discussion_r1874839885 > > > > > > On Fri, Dec 6, 2024 at 2:26 PM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > > > Hi Everyone, > > > > I think we are making good progress on the shredding specification. > > > > > > > > There are higher level concerns that might benefit from more points > of > > > > view. The relevant github discussions are linked below [2][3]. I've > > also > > > > provided a summary below (apologies for its length). > > > > > > > > For background, Variant is a semi-structured type akin to JSON but > > with a > > > > richer set of primitive types (e.g. timestamp). The current spec > also > > > > defines "logical" types that have different physical representations > > [1] > > > > (e.g. "Exact numeric" is composed of Decimal, int32, int64, etc; and > > > there > > > > are two different physical representations of strings). > > > > > > > > Shredding refers to a storage optimization of extracting some fields > > and > > > > storing them as individual columns. e.g. > > > > > > > > With variants: > > > > {"a": 1, "b": "abc"} > > > > {"a": 1.2, "b": 1} > > > > > > > > The "a" field could be stored in a separate column of Decimal(p=2, > > s=1), > > > > which would accurately preserve the semantics of the value (and then > be > > > > merged back with the "b" column which would be stored as the binary > > > > encoding [1] separately). > > > > > > > > This scheme provides several optimizations for readers: > > > > 1. It allows for more efficient projection (e.g. > > AS_NUMBER(EXTRACT($.a, > > > > variant_col)) would not need to parse the binary encoding) > > > > 2. It potentially allows for using statistics for queries like > > > > AS_NUMBER(EXTRACT($.a, variant_col)) > 2 could prune the row > > > groups/pages. > > > > 3. It allows for potentially better compression of the values. > > > > > > > > Optimizations 1 and 2, require knowing that there is nothing stored > in > > > > Binary encoded form that could affect the results. Either everything > > is > > > > stored in the typed column or any value stored in binary encoded form > > > would > > > > not affect the result (e.g. in the case above for field "a" if all > > values > > > > stored in binary encoded form are strings, that is if there is a > third > > > > value `{"a": "n/a" }` it would not affect the results since AS_NUMBER > > > would > > > > return null for it). > > > > > > > > Optimization 1,2 and 3, is most effective when values are > "normalized" > > > > (e.g. int32 and int64 are both stored as int64 allowing for more > values > > > to > > > > be stored as a normal column instead of binary encoded form). > > > > > > > > Given this background, the open questions are: > > > > 1. Should parquet implementations have freedom to do normalization > on > > > > their own (i.e. consolidate different semantically equivalent > physical > > > > types into one shredded column and therefore potentially return > > different > > > > physical types to readers then were provided as input). If > > > implementations > > > > can't do this optimization, the implications are either: > > > > a. Parquet implementations rely on higher level components to try > > to > > > > normalize as much as possible > > > > b. Parquet adds into the specification additional metadata at > the > > > > parquet level to still do normalization but reproduce the original > > > physical > > > > type. > > > > c. We lose some level of optimization if semantically equivalent > > > types > > > > are stored, since only one physical type can be shredded into its own > > > > column. > > > > > > > > 2. Avoiding parsing binary encoded variant values is one of the main > > > goals > > > > of shredding. Therefore, it is useful to have as much metadata to > make > > > the > > > > determination if binary encoded values are relevant to a > > > projection/filter > > > > operation. In the long run this could include adding variant > specific > > > > statistics about which types are encoded, but in short run, this > > exposes > > > > the fact that the binary encoding of types appears to be suboptimal > for > > > use > > > > in min/max statistics. > > > > > > > > IIUC, the current encoding appears to encode type information in the > > > least > > > > significant bits of the first byte. This implies that some types can > > > > make min/max values effectively useless for determining which set of > > > types > > > > are encoded (e.g. short strings can make it impossible to tell if all > > > > values belong to the same type). A secondary concern is that type > ids > > > are > > > > tightly packed now. This means introduction of a new physical type > > that > > > > is semantically equivalent to existing type (e.g. float16) could > make > > > > stats less useful. > > > > > > > > Fixing the binary encoding issues would likely require a different > > > version > > > > in the binary specification [4] to accomodate data written before the > > > spec > > > > was donated to the parquet community (as well as the baggage of > > backwards > > > > compatibility). > > > > > > > > Cheers, > > > > Micah > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > > > > [2] > > https://github.com/apache/parquet-format/pull/461/files#r1851373388 > > > > [3] > > > > > > https://github.com/apache/parquet-format/pull/461#discussion_r1855433947 > > > > [4] > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding > > > > > > > > > >