viirya commented on PR #52578:
URL: https://github.com/apache/spark/pull/52578#issuecomment-3398431922

   > Thanks @viirya for the PR! Appreciate you putting together an alternative. 
A couple of thoughts before we move forward:
   > 
   >     1. Variant + column pruning are not special-cased semantics
   > 
   > 
   >     * Variant is just another column in the projected schema from Spark’s 
perspective. When a source supports column pruning, Spark should prune all 
columns it doesn’t need, including Variant, exactly like it does for other 
types. I don't think we need to introduce a new interface to control if Variant 
should be pruned.
   > 
   >     * If a particular DSv2 implementation does not support column pruning, 
then pushing Variant access down into that source is not meaningful; the 
primary benefit of Variant pushdown is to reduce the read surface via pruning.
   
   Column pruning is not the same as this variant scan pushdown. This feature 
basically changes the datatype (variant type to struct type) not just pruning 
fields from it. Mixing column pruning with variant scan pushdown causes 
confusion on API semantics and unexpected errors on DSv2 datasource 
implementation. Actually you can access all fields from a variant column but 
this variant scan pushdown still works effectively. It is not about reducing 
read surface but optimize variant access and usage in the read and the query 
plan.
   
   No to mention that with current approach if it pushes a variant scan into a 
DSv2 implementation doesn't know about it, there will be unexpected errors.
   
   >     2. On testing built-in Parquet vs. InMemoryTable
   > 
   > 
   >     * The InMemoryTable tests were designed to validate the planner and 
API plumbing, to test that Spark forms the right pruned schema and pushdown 
contracts and the source receives them. It should be sufficient for testing.
   > 
   >     * It's nice to add a smoke test with the built-in DSv2 Parquet to 
increase end-to-end confidence and catch regressions in common deployments. But 
I don't think it should be used as a judgement that a PR is not good.
   
   No. The test just verifies the schema was changed. InMemoryTable/Scan 
doesn't actually do the variant pushdown. In other words, it just makes sure 
that you changed the variant type to struct type in the InMemoryTable. It 
doesn't actually test if the variant scan happens correctly or not, isn't? If 
the table/scan doesn't support it, what will happen? In other words, for 
example, we can simply change a variant type to any other type and keep the 
test passing because the schema is matched. Btw, there are actually issues on 
the test, but that is not the only issue going to revert the PR.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to