aokolnychyi commented on PR #53276:
URL: https://github.com/apache/spark/pull/53276#issuecomment-3614276778
> Currently VariantAccessInfo represents an access to a variant column. So
it has a member String columnName. What does a VariantExtraction represent for?
> Although you said "each variant_get expression as separate
VariantExtraction", if there are multiple variant_gets for same variant column,
you mean to have multiple VariantExtractions? Currently they are all
represented by one VariantAccessInfo for the variant column, I think it makes
more sense.
I expect each `variant_get ` and `try_ variant_get` to be converted to
`VariantExtraction` with variant column name parts and extraction JSON path. If
a connector has shredded 2 out 3 requested columns, it can simply mark with
booleans what it supports and what must be done in Spark. If we use
`VariantAccessInfo`, then they would have to create a new StructType? Seems
very complicated and error-prone.
> I think connectors still can read and parse the variant to required type
even it is not a shredded variant. From the view of Spark and DSv2 API, we
don't need to know how the connectors fulfill the pushdown requirement.
I feel this is VERY dangerous. I read through the casting logic in Spark. It
has so many edge cases. There is no way connectors will replicate this
behavior. We don't want to have inconsistent shredding between connectors. In
the future, we may add a casting function to `VariantExtraction` that Spark
would provide. That said, I would not do it now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]