huaxingao commented on PR #52578: URL: https://github.com/apache/spark/pull/52578#issuecomment-3398211655
Thanks @viirya for the PR! Appreciate you putting together an alternative. A couple of thoughts before we move forward: 1. Variant + column pruning are not special-cased semantics - Variant is just another column in the projected schema from Spark’s perspective. When a source supports column pruning, Spark should prune all columns it doesn’t need, including Variant, exactly like it does for other types. I don't think we need to introduce a new interface to control if Variant should be pruned. - If a particular DSv2 implementation does not support column pruning, then pushing Variant access down into that source is not meaningful; the primary benefit of Variant pushdown is to reduce the read surface via pruning. 2. On testing built-in Parquet vs. InMemoryTable - The InMemoryTable tests were designed to validate the planner and API plumbing, to test that Spark forms the right pruned schema and pushdown contracts and the source receives them. It should be sufficient for testing. - It's nice to add a smoke test with the built-in DSv2 Parquet to increase end-to-end confidence and catch regressions in common deployments. But I don't think it should be used as a judgement that a PR is not good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
