rdblue commented on a change in pull request #887: Define file and position 
based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401746023
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains 
the file as “live” data is garbage collected. But this is harder to detect and 
requires finding the diff of multiple snapshots. It is easier to track what 
files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be 
applied to the dataset at read time. Deletion files may either specify rows by 
column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
+```
+
+The rows in the deletion file must be sorted by `filename` and `position` so 
as to leverage the merge sort. The layout of sorted records in the deletion 
file looks like:
+```
+file1, 1
+file1, 2
+file1, 5
+file2, 3
+file2, 4
+file2, 7
+file3, 6
+file3, 8
+file3, 9
+```
+ 
+It is also worth to note that in order to keep module independence, deletion 
files are written with the same file format as the table's file format.
 
 Review comment:
   This is a recommendation not a requirement, so we should specifically say 
that. The requirement is that a delete file can be written using any supported 
data file format.
   
   Also, the purpose of the recommendation is not module independence. People 
choose file formats based on what they use for most tables and have experience 
tuning, so it makes sense to use the same format for delete files and delta 
files.
   
   It's convenient to not need to build a service with a dependency on 
iceberg-orc or iceberg-parquet if all data and delete files are Avro, but we 
don't want to have people misinterpret the spec and think that there is a 
guarantee that delete file formats match data files.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to