aokolnychyi commented on PR #53276:
URL: https://github.com/apache/spark/pull/53276#issuecomment-3614192738

   I personally would want the following changes to VARIANT pushdown in DSv2:
   
   1. Move the logic to ScanBuilder instead of Scan (this PR attempts to do 
exactly that).
   2. Evolve the connector API.
   
   - Rename interfaces to `SupportsPushDownVariantExtractions` / 
`VariantExtraction` (alternatives are welcome).
   - Pass each `variant_get` expression as separate `VariantExtraction` so that 
connectors can check each field easily.
   - Return `boolean[]` from `pushVariantExtractions` to indicate what was 
pushed.
   
   ```
   interface SupportsPushDownVariantExtractions extends ScanBuilder {
     boolean[] pushVariantExtractions(VariantExtraction[] extractions);
   }
   
   interface VariantExtraction {
     String[] columnName; // variant column name
     String path; // extraction path from variant_get and try_variant_get
     DataType expectedDataType; // expected data type
   }
   ```
   
   3. Clearly state that connectors must only push down an extraction if the 
data has been shredded before. Connectors should not try to attempt to cast / 
extract on demand. It must be done in Spark if the data hasn't been shredded 
prior to the scan.
   
   Basically consider the following example:
   
   ```
     SELECT
       variant_get(v, '$.data[1].a', 'string'),
       variant_get(v, '$.key', 'int'),
       variant_get(s.v2, '$.x', 'double')
     FROM tbl;
   ```
   
   This should pass the following extractions to connector:
   
   ```
     VariantExtraction[] extractions = [
       new VariantExtraction(["v"], "$.data[1].a", StringType),
       new VariantExtraction(["v"], "$.key", IntegerType),
       new VariantExtraction(["s", "v2"], "$.x", DoubleType)  // ← nested in 
struct 's'
     ];
   ```
   
   Connectors mark `VariantExtraction` as pushed ONLY if they can guarantee 
that ALL records satisfy the expected type, meaning the data has been shredded 
prior to the scan.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to