cashmand commented on PR #46831: URL: https://github.com/apache/spark/pull/46831#issuecomment-2168665661
Hi @shaeqahmed, sorry for the delay, and for not replying earlier about how nested structs are handled. I’ll try to update the doc with an example, but in the meantime, the plan is to support two of the cases you described: - Adding the struct directly as a nested key path within the existing `paths` structure is meant to be the primary approach. The example at the end of the doc shows an array-of-struct with this form, but a struct-of-struct would look the same. At any nesting level, if a given key doesn’t exist in the parquet schema, it would be stored in the top-level `value` binary. A request for any non-leaf field would require checking the top-level `value`, and merging the result with the shredded values (as described in the pseudo-code in the PR). - Adding a nested key path as a nested Variant is supported. This is indicated by just including `untyped_value`, with no corresponding `typed_value`. But in this case, it wouldn’t be possible to recursively shred the nested value. Please let me know if the above is clear, or if I’m misunderstanding the question. Thanks for describing your use case and the papers you’re referenced. The CloudTrail use case makes a lot of sense, and is definitely one that we should consider carefully. For the current approach, I think it would make sense to shred a field like `requestParameters` as a Variant binary. This would provide a lot of the benefit, since queries on `requestParameters` would not need to fetch the top-level binary or any other columns. I can see that the more flexible schema you’ve proposed could provide better performance for some query patterns, though. At the same time, we’d like to aim to minimize the complexity in the spec, the Parquet footer, and implementation. I’d like to spend a bit more time looking at the papers you’ve linked to, and considering the trade-offs between the proposals. Can you give us a better idea of what type of queries you expect to see on the read path, and how your scheme would benefit? E.g. would you expect to typically see a mix of queries that need all of `requestParameters`, and others that only need a field or two? What type of query is likely to benefit significantly from shredding different types (e.g. integer and string) vs. just shredding the most common type, and fetching the rest from the binary? We would like to better understand how the shredding scheme will improve read performance for your workload. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
