rdblue commented on a change in pull request #887: Define file and position 
based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401748494
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains 
the file as “live” data is garbage collected. But this is harder to detect and 
requires finding the diff of multiple snapshots. It is easier to track what 
files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be 
applied to the dataset at read time. Deletion files may either specify rows by 
column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
+```
+
+The rows in the deletion file must be sorted by `filename` and `position` so 
as to leverage the merge sort. The layout of sorted records in the deletion 
file looks like:
+```
+file1, 1
+file1, 2
+file1, 5
+file2, 3
+file2, 4
+file2, 7
+file3, 6
+file3, 8
+file3, 9
+```
 
 Review comment:
   I don't think this is necessary. It should be sufficient to say that the 
delete file must be sorted by ascending filename then position.
   
   This is for two reasons:
   1. Sorting by file allows filter pushdown by file in columnar storage 
formats.
   2. Sorting by position allows filtering rows while scanning, to avoid 
keeping deletes in memory.
   
   I think it would help to rephrase "so as to leverage the merge sort" to 
"allow filtering rows while scanning" or something similar. Although we think 
of this as merge-sort, the operation is not a sort that produces a sorted list 
of deletes and rows -- it's a filter.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to