cashmand commented on PR #46831: URL: https://github.com/apache/spark/pull/46831#issuecomment-2155445121
Hi @shaeqahmed, thanks for your detailed response. Your suggestions add a lot of flexibility to the shredding scheme! At the same time, we are wary of adding complexity that could be a burden on implementations to support. Our expectation is that the primary candidate for shredding is data with a fairly uniform sub-structure. In particular, this assumption simplifies the shredding decision and behavior. With a more flexible shredding scheme it also becomes more difficult to decide what to shred since there are significantly more viable options with nuanced tradeoffs. Simplicity in implementation and user-observed behavior is very important to us. Since the benefits of shredding are both data and workload dependent, could you help us understand concrete query examples for your suggested features? Do you have particular use cases where you expect to write a non-uniform shredding scheme, and get a significant performance benefit? A few specific points: > In my proposal, the metadata fields are also made optional, which if not present, means that the metadata is encoded in the value. Do you have specific typed/untyped combinations in mind that are common based on your experience? Adding options to the spec increases the implementation complexity (readers need to support both versions to function correctly), and we’d like to explore the impact of these choices more concretely. Our motivation for combining the metadata and value is to reduce the size of the Parquet schema. Large schemas can be quite a performance burden due to how Parquet stores its footer, especially for selective queries. > metadata_key_paths Do you have a concrete query in mind for this feature? My understanding is that it is redundant, and readers could safely ignore it. We’ve purposely designed shredding without redundancy to avoid unexpected increases in storage. In the RFC for Delta [here](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md#variant-data-in-parquet), it mentions `struct fields which start with _ (underscore) can be safely ignored.` I think we could add that to the spec here, and perhaps reserve _metadata_key_paths as a keyword for a future addition to the spec. As long as readers ignore fields with underscore, it shouldn’t cause any backwards compatibility issues. > union of the types observed at each path I’d be interested in understanding the expected use cases. Distinguishing different types for scalar fields does not seem to add much value compared to storing mismatched types in an untyped_value column, and adds complexity to the spec and implementation. Could you highlight an example or query pattern where having different typed-values would provide significant benefits over a single typed-value? Our assumption is that one of the types will be most common and shredding should focus on that one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
