[GitHub] [iceberg] flyrain commented on a diff in pull request #4683: Read deleted rows with metadata column IS_DELETED

GitBox Tue, 17 May 2022 22:25:49 -0700


flyrain commented on code in PR #4683:
URL: https://github.com/apache/iceberg/pull/4683#discussion_r875479043



##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -290,8 +295,6 @@ private static Schema fileProjection(Schema tableSchema, 
Schema requestedSchema,
       requiredIds.addAll(eqDelete.equalityFieldIds());
     }
 
-    requiredIds.add(MetadataColumns.IS_DELETED.fieldId());

Review Comment:
   We project the pos column only if there are pos deletes as the following 
code shows, which makes sense, since we need it for filtering pos deletes. 
   ```
       if (!posDeletes.isEmpty()) {
         requiredIds.add(MetadataColumns.ROW_POSITION.fieldId());
       }
   ```
   Here is my thought on Is_deleted column, it presents only if the front 
end(e.g. spark read) asked for it. For example, in case of CDC, we put it 
filter like this to read deleted rows. Here is the code from my CDC draft PR 
#4539.
   ```
       Dataset<Row> scanDF = spark().read().format("iceberg")
           .option(SparkReadOptions.FILE_SCAN_TASK_SET_ID, groupID)
           .load(table.name())
           
.filter(functions.column(MetadataColumns.IS_DELETED.name()).equalTo(true));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] flyrain commented on a diff in pull request #4683: Read deleted rows with metadata column IS_DELETED

Reply via email to