(iceberg) branch main updated: Doc: Update rewrite data files spark procedure (#11396)

szehon Thu, 31 Oct 2024 13:31:28 -0700

This is an automated email from the ASF dual-hosted git repository.

szehon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/main by this push:
     new 57fb6d5658 Doc: Update rewrite data files spark procedure (#11396)
57fb6d5658 is described below

commit 57fb6d56588ea91e995663b1a6f8bfee34060fa8
Author: Hongyue/Steve Zhang <[email protected]>
AuthorDate: Thu Oct 31 13:31:14 2024 -0700

    Doc: Update rewrite data files spark procedure (#11396)
---
 docs/docs/spark-procedures.md | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/docs/docs/spark-procedures.md b/docs/docs/spark-procedures.md
index 0953e729a7..40407b8c0f 100644
--- a/docs/docs/spark-procedures.md
+++ b/docs/docs/spark-procedures.md
@@ -393,6 +393,7 @@ Iceberg can compact data files in parallel using Spark with 
the `rewriteDataFile
 | `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to 
be simultaneously rewritten |
 | `partial-progress.enabled` | false | Enable committing groups of files prior 
to the entire rewrite completing |
 | `partial-progress.max-commits` | 10 | Maximum amount of commits that this 
rewrite is allowed to produce if partial progress is enabled |
+| `partial-progress.max-failed-commits` | value of 
`partital-progress.max-commits` | Maximum amount of failed commits allowed 
before job failure, if partial progress is enabled |
 | `use-starting-sequence-number` | true | Use the sequence number of the 
snapshot at compaction start time instead of that of the newly produced 
snapshot |
 | `rewrite-job-order` | none | Force the rewrite job order based on the value. 
<ul><li>If rewrite-job-order=bytes-asc, then rewrite the smallest job groups 
first.</li><li>If rewrite-job-order=bytes-desc, then rewrite the largest job 
groups first.</li><li>If rewrite-job-order=files-asc, then rewrite the job 
groups with the least files first.</li><li>If rewrite-job-order=files-desc, 
then rewrite the job groups with the most files first.</li><li>If 
rewrite-job-order=none, then rewrite job g [...]
 | `target-file-size-bytes` | 536870912 (512 MB, default value of 
`write.target-file-size-bytes` from [table 
properties](configuration.md#write-properties)) | Target output file size |
@@ -402,7 +403,13 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
 | `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data 
that should be rewritten in a single file group. The entire rewrite operation 
is broken down into pieces based on partitioning and within partitions based on 
size into file-groups.  This helps with breaking down the rewriting of very 
large partitions which may not be rewritable otherwise due to the resource 
constraints of the cluster. |
 | `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Identifier of the output 
partition spec. Data will be reorganized during the rewrite to align with the 
output partitioning. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality 
deletes after rewriting. A delete file is considered dangling if it does not 
apply to any live data files. Enabling this will generate an additional commit 
for the removal. |
 
+!!! info
+    Dangling delete files are removed based solely on data sequence numbers. 
This action does not apply to global 
+    equality deletes or invalid equality deletes if their delete conditions do 
not match any data files, 
+    nor to position delete files containing position deletes no longer 
matching any live data files.
 
 ##### Options for sort strategy
 
@@ -447,9 +454,9 @@ Using the same defaults as bin-pack to determine which 
files to rewrite.
 CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy => 
'sort', sort_order => 'zorder(c1,c2)');
 ```
 
-Rewrite the data files in table `db.sample` using bin-pack strategy in any 
partition where more than 2 or more files need to be rewritten.
+Rewrite the data files in table `db.sample` using bin-pack strategy in any 
partition where at least two files need rewriting, and then remove any dangling 
delete files.
 ```sql
-CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => 
map('min-input-files','2'));
+CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => 
map('min-input-files', '2', 'remove-dangling-deletes', 'true'));
 ```
 
 Rewrite the data files in table `db.sample` and select the files that may 
contain data matching the filter (id = 3 and name = "foo") to be rewritten.

(iceberg) branch main updated: Doc: Update rewrite data files spark procedure (#11396)

Reply via email to