Hi Everyone,
I think we are making good progress on the shredding specification.

There are  higher level concerns that might benefit from more points of
view.  The relevant github discussions are linked below [2][3]. I've also
provided a summary below (apologies for its length).

For background, Variant is a semi-structured type akin to JSON but with a
richer set of primitive types (e.g. timestamp).  The current spec also
defines "logical" types that have different physical representations [1]
(e.g. "Exact numeric" is composed of Decimal, int32, int64, etc; and there
are two different physical representations of strings).

Shredding refers to a storage optimization of extracting some fields and
storing them as individual columns.  e.g.

With variants:
{"a": 1, "b": "abc"}
{"a": 1.2, "b": 1}

The "a" field could be stored in a separate column of Decimal(p=2, s=1),
which would accurately preserve the semantics of the value (and then be
merged back with the "b" column which would be stored as the binary
encoding [1] separately).

This scheme provides several optimizations for readers:
1.  It allows for more efficient projection (e.g. AS_NUMBER(EXTRACT($.a,
variant_col)) would not need to parse the binary encoding)
2.  It potentially allows for using statistics for queries like
AS_NUMBER(EXTRACT($.a, variant_col)) > 2 could prune the row groups/pages.
3.  It allows for potentially better compression of the values.

Optimizations 1 and 2, require knowing that there is nothing stored in
Binary encoded form that could affect the results.  Either everything is
stored in the typed column or any value stored in binary encoded form would
not affect the result (e.g. in the case above for field "a" if all values
stored in binary encoded form are strings, that is if there is a third
value `{"a": "n/a" }` it would not affect the results since AS_NUMBER would
return null for it).

Optimization 1,2 and 3, is most effective when values are "normalized"
(e.g. int32 and int64 are both stored as int64 allowing for more values to
be stored as a normal column instead of  binary encoded form).

Given this background, the open questions are:
1.  Should parquet implementations have freedom to do normalization on
their own (i.e. consolidate different semantically equivalent physical
types into one shredded column and therefore potentially return different
physical types to readers then were provided as input).  If implementations
can't do this optimization, the implications are either:
    a. Parquet implementations rely on higher level components to try to
normalize as much as possible
    b.  Parquet adds into the specification additional metadata at the
parquet level to still do normalization but reproduce the original physical
type.
    c.  We lose some level of optimization if semantically equivalent types
are stored, since only one physical type can be shredded into its own
column.

2.  Avoiding parsing binary encoded variant values is one of the main goals
of shredding.  Therefore, it is useful to have as much metadata to make the
determination if binary encoded values are relevant to a projection/filter
operation.  In the long run this could include adding variant specific
statistics about which types are encoded, but in short run, this exposes
the fact that the binary encoding of types appears to be suboptimal for use
in min/max statistics.

IIUC, the current encoding appears to encode type information in the least
significant bits of the first byte.  This implies that some types can
make min/max values effectively useless for determining which set of types
are encoded (e.g. short strings can make it impossible to tell if all
values belong to the same type).  A secondary concern is that type ids are
tightly packed now.  This means introduction of a new physical type that
is  semantically equivalent to existing type (e.g. float16) could make
stats less useful.

Fixing the binary encoding issues would likely require a different version
in the binary specification [4] to accomodate data written before the spec
was donated to the parquet community (as well as the baggage of backwards
compatibility).

Cheers,
Micah





[1] https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
[2] https://github.com/apache/parquet-format/pull/461/files#r1851373388
[3] https://github.com/apache/parquet-format/pull/461#discussion_r1855433947
[4]
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding

Reply via email to