cashmand commented on PR #46831:
URL: https://github.com/apache/spark/pull/46831#issuecomment-2155445121

   Hi @shaeqahmed, thanks for your detailed response. Your suggestions add a 
lot of flexibility to the shredding scheme! At the same time, we are wary of 
adding complexity that could be a burden on implementations to support. Our 
expectation is that the primary candidate for shredding is data with a fairly 
uniform sub-structure. In particular, this assumption simplifies the shredding 
decision and behavior. With a more flexible shredding scheme it also becomes 
more difficult to decide what to shred since there are significantly more 
viable options with nuanced tradeoffs. Simplicity in implementation and 
user-observed behavior is very important to us.
   
   Since the benefits of shredding are both data and workload dependent, could 
you help us understand concrete query examples for your suggested features? Do 
you have particular use cases where you expect to write a non-uniform shredding 
scheme, and get a significant performance benefit?
   
   A few specific points:
   
   > In my proposal, the metadata fields are also made optional, which if not 
present, means that the metadata is encoded in the value.
   Do you have specific typed/untyped combinations in mind that are common 
based on your experience? Adding options to the spec increases the 
implementation complexity (readers need to support both versions to function 
correctly), and we’d like to explore the impact of these choices more 
concretely.
   Our motivation for combining the metadata and value is to reduce the size of 
the Parquet schema. Large schemas can be quite a performance burden due to how 
Parquet stores its footer, especially for selective queries.
   
   > metadata_key_paths
   Do you have a concrete query in mind for this feature? My understanding is 
that it is redundant, and readers could safely ignore it. We’ve purposely 
designed shredding without redundancy to avoid unexpected increases in storage. 
In the RFC for Delta 
[here](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md#variant-data-in-parquet),
 it mentions `struct fields which start with _ (underscore) can be safely 
ignored.` I think we could add that to the spec here, and perhaps reserve 
_metadata_key_paths as a keyword for a future addition to the spec. As long as 
readers ignore fields with underscore, it shouldn’t cause any backwards 
compatibility issues.
   
   > union of the types observed at each path
   I’d be interested in understanding the expected use cases. Distinguishing 
different types for scalar fields does not seem to add much value compared to 
storing mismatched types in an untyped_value column, and adds complexity to the 
spec and implementation. Could you highlight an example or query pattern where 
having different typed-values would provide significant benefits over a single 
typed-value? Our assumption is that one of the types will be most common and 
shredding should focus on that one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to