[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

GitBox Wed, 28 Sep 2022 11:15:35 -0700


aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r982723941



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java:
##########
@@ -45,4 +45,18 @@ public interface LogicalWriteInfo {
    * the schema of the input data from Spark to data source.
    */
   StructType schema();
+
+  /**
+   * the schema of the input metadata from Spark to data source.
+   */
+  default StructType metadataSchema() {
+    return null;

Review Comment:
   @amaliujia @cloud-fan, we can do something like this.
   
   ```
   /**
    * the schema of the ID columns from Spark to data source.
    */
   default Optional<StructType> rowIdSchema() {
     throw new UnsupportedOperationException(
         getClass().getName() + " does not implement rowIdSchema");
   }
   
   /**
    * the schema of the input metadata from Spark to data source.
    */
   default Optional<StructType> metadataSchema() {
     throw new UnsupportedOperationException(
         getClass().getName() + " does not implement metadataSchema");
   }
   ```
   
   Now the question is what to report in `schema()` for delta-based DELETE 
operations, where we do not pass the row, we just pass row ID and metadata. One 
option is to report an empty struct but let me know if you have other ideas.
   
   The way I was approaching it initially:
   
   ```
   schema() -> the row schema for new records (only MERGE adds new records)
   rowIdSchema() -> the schema for row ID passed to data sources to mark a 
record as deleted/updated
   metadataSchema() -> the schema of projected metadata columns that contain 
some extra info about the row that is being deleted/updated
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

Reply via email to