This is an automated email from the ASF dual-hosted git repository.
szehon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push:
new 57fb6d5658 Doc: Update rewrite data files spark procedure (#11396)
57fb6d5658 is described below
commit 57fb6d56588ea91e995663b1a6f8bfee34060fa8
Author: Hongyue/Steve Zhang <[email protected]>
AuthorDate: Thu Oct 31 13:31:14 2024 -0700
Doc: Update rewrite data files spark procedure (#11396)
---
docs/docs/spark-procedures.md | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/docs/docs/spark-procedures.md b/docs/docs/spark-procedures.md
index 0953e729a7..40407b8c0f 100644
--- a/docs/docs/spark-procedures.md
+++ b/docs/docs/spark-procedures.md
@@ -393,6 +393,7 @@ Iceberg can compact data files in parallel using Spark with
the `rewriteDataFile
| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `partial-progress.max-failed-commits` | value of
`partital-progress.max-commits` | Maximum amount of failed commits allowed
before job failure, if partial progress is enabled |
| `use-starting-sequence-number` | true | Use the sequence number of the
snapshot at compaction start time instead of that of the newly produced
snapshot |
| `rewrite-job-order` | none | Force the rewrite job order based on the value.
<ul><li>If rewrite-job-order=bytes-asc, then rewrite the smallest job groups
first.</li><li>If rewrite-job-order=bytes-desc, then rewrite the largest job
groups first.</li><li>If rewrite-job-order=files-asc, then rewrite the job
groups with the least files first.</li><li>If rewrite-job-order=files-desc,
then rewrite the job groups with the most files first.</li><li>If
rewrite-job-order=none, then rewrite job g [...]
| `target-file-size-bytes` | 536870912 (512 MB, default value of
`write.target-file-size-bytes` from [table
properties](configuration.md#write-properties)) | Target output file size |
@@ -402,7 +403,13 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
| `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data
that should be rewritten in a single file group. The entire rewrite operation
is broken down into pieces based on partitioning and within partitions based on
size into file-groups. This helps with breaking down the rewriting of very
large partitions which may not be rewritable otherwise due to the resource
constraints of the cluster. |
| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs
to be associated with a data file for it to be considered for rewriting |
+| `output-spec-id` | current partition spec id | Identifier of the output
partition spec. Data will be reorganized during the rewrite to align with the
output partitioning. |
+| `remove-dangling-deletes` | false | Remove dangling position and equality
deletes after rewriting. A delete file is considered dangling if it does not
apply to any live data files. Enabling this will generate an additional commit
for the removal. |
+!!! info
+ Dangling delete files are removed based solely on data sequence numbers.
This action does not apply to global
+ equality deletes or invalid equality deletes if their delete conditions do
not match any data files,
+ nor to position delete files containing position deletes no longer
matching any live data files.
##### Options for sort strategy
@@ -447,9 +454,9 @@ Using the same defaults as bin-pack to determine which
files to rewrite.
CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy =>
'sort', sort_order => 'zorder(c1,c2)');
```
-Rewrite the data files in table `db.sample` using bin-pack strategy in any
partition where more than 2 or more files need to be rewritten.
+Rewrite the data files in table `db.sample` using bin-pack strategy in any
partition where at least two files need rewriting, and then remove any dangling
delete files.
```sql
-CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options =>
map('min-input-files','2'));
+CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options =>
map('min-input-files', '2', 'remove-dangling-deletes', 'true'));
```
Rewrite the data files in table `db.sample` and select the files that may
contain data matching the filter (id = 3 and name = "foo") to be rewritten.