Re: [PR] [SPARK-48495][SQL][DOCS] Describe shredding scheme for Variant [spark]

via GitHub Fri, 07 Jun 2024 12:45:34 -0700


cashmand commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2155445121

Hi @shaeqahmed, thanks for your detailed response. Your suggestions add a
lot of flexibility to the shredding scheme! At the same time, we are wary of
adding complexity that could be a burden on implementations to support. Our
expectation is that the primary candidate for shredding is data with a fairly
uniform sub-structure. In particular, this assumption simplifies the shredding
decision and behavior. With a more flexible shredding scheme it also becomes
more difficult to decide what to shred since there are significantly more
viable options with nuanced tradeoffs. Simplicity in implementation and
user-observed behavior is very important to us.

Since the benefits of shredding are both data and workload dependent, could
you help us understand concrete query examples for your suggested features? Do
you have particular use cases where you expect to write a non-uniform shredding
scheme, and get a significant performance benefit?

A few specific points:

> In my proposal, the metadata fields are also made optional, which if not
present, means that the metadata is encoded in the value.
Do you have specific typed/untyped combinations in mind that are common
based on your experience? Adding options to the spec increases the
implementation complexity (readers need to support both versions to function
correctly), and we’d like to explore the impact of these choices more
concretely.
Our motivation for combining the metadata and value is to reduce the size of
the Parquet schema. Large schemas can be quite a performance burden due to how
Parquet stores its footer, especially for selective queries.

> metadata_key_paths
Do you have a concrete query in mind for this feature? My understanding is
that it is redundant, and readers could safely ignore it. We’ve purposely
designed shredding without redundancy to avoid unexpected increases in storage.
In the RFC for Delta
[here](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md#variant-data-in-parquet),
it mentions `struct fields which start with _ (underscore) can be safely
ignored.` I think we could add that to the spec here, and perhaps reserve
_metadata_key_paths as a keyword for a future addition to the spec. As long as
readers ignore fields with underscore, it shouldn’t cause any backwards
compatibility issues.

> union of the types observed at each path
I’d be interested in understanding the expected use cases. Distinguishing
different types for scalar fields does not seem to add much value compared to
storing mismatched types in an untyped_value column, and adds complexity to the
spec and implementation. Could you highlight an example or query pattern where
having different typed-values would provide significant benefits over a single
typed-value? Our assumption is that one of the types will be most common and
shredding should focus on that one.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48495][SQL][DOCS] Describe shredding scheme for Variant [spark]

Reply via email to