aokolnychyi commented on PR #53276:
URL: https://github.com/apache/spark/pull/53276#issuecomment-3614192738
I personally would want the following changes to VARIANT pushdown in DSv2:
1. Move the logic to ScanBuilder instead of Scan (this PR attempts to do
exactly that).
2. Evolve the connector API.
- Rename interfaces to `SupportsPushDownVariantExtractions` /
`VariantExtraction` (alternatives are welcome).
- Pass each `variant_get` expression as separate `VariantExtraction` so that
connectors can check each field easily.
- Return `boolean[]` from `pushVariantExtractions` to indicate what was
pushed.
```
interface SupportsPushDownVariantExtractions extends ScanBuilder {
boolean[] pushVariantExtractions(VariantExtraction[] extractions);
}
interface VariantExtraction {
String[] columnName; // variant column name
String path; // extraction path from variant_get and try_variant_get
DataType expectedDataType; // expected data type
}
```
3. Clearly state that connectors must only push down an extraction if the
data has been shredded before. Connectors should not try to attempt to cast /
extract on demand. It must be done in Spark if the data hasn't been shredded
prior to the scan.
Basically consider the following example:
```
SELECT
variant_get(v, '$.data[1].a', 'string'),
variant_get(v, '$.key', 'int'),
variant_get(s.v2, '$.x', 'double')
FROM tbl;
```
This should pass the following extractions to connector:
```
VariantExtraction[] extractions = [
new VariantExtraction(["v"], "$.data[1].a", StringType),
new VariantExtraction(["v"], "$.key", IntegerType),
new VariantExtraction(["s", "v2"], "$.x", DoubleType) // ← nested in
struct 's'
];
```
Connectors mark `VariantExtraction` as pushed ONLY if they can guarantee
that ALL records satisfy the expected type, meaning the data has been shredded
prior to the scan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]