szehon-ho commented on code in PR #7589:
URL: https://github.com/apache/iceberg/pull/7589#discussion_r1199488906
##########
docs/spark-procedures.md:
##########
@@ -364,6 +364,53 @@ Rewrite the manifests in table `db.sample` and disable the
use of Spark caching.
CALL catalog_name.system.rewrite_manifests('db.sample', false)
```
+### `rewrite_position_delete_files`
+
+Iceberg can rewrite position delete files, which serves two purposes:
+* Minor Compaction: Compact small position delete files into larger ones.
This reduces size of metadata stored in manifest files and overhead of opening
small delete files.
+* Remove Dangling Deletes: Filter out position delete records that refer to
data files that are no longer live. After rewrite_data_files, position delete
records pointing to the rewritten data files are not immediately marked for
removeal and remain tracked by the table's live snapshot metadata. This is
known as the 'dangling delete' problem, and is because a single position delete
file can apply to more than one data file, and not all applicable data files
are removed during rewrite.
+
+Iceberg can rewrite position delete files in parallel using Spark with the
`rewritePositionDeletes` action.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table` | ✔️ | string | Name of the table to update |
+| `options` | ️ | map<string, string> | Options to be used for actions|
+
+See the [`SizeBasedFileRewriter` Javadoc](../../../javadoc/{{% icebergVersion
%}}/org/apache/iceberg/actions/SizeBasedFileRewriter.html#field.summary),
+for list of all the supported options for this action.
+
+All rewritten position delete files are filtered to remove dangling deletes.
+
+#### Output
+
+| Output Name | Type | Description
|
+|--------------------------------|------|-----------------------------------------------------------------------------|
+| `rewritten_delete_files_count` | int | Number of delete files which were
removed by this command |
+| `rewritten_bytes_count` | long | Count of bytes across delete files
which were removed by this command |
+| `rewritten_delete_files_count` | int | Number of delete files which were
added by this command |
+| `added_delete_files_count` | long | Count of bytes across all new delete
files which were added by this command |
+
+
+#### Examples
+
+Rewrite position delete files in table `db.sample` with default options. This
rewrites position delete files that conform to the default count-per-partition
and size thresholds, and rewrite them using the default target sizes. Dangling
deletes are removed from rewritten delete files.
+```sql
+CALL catalog_name.system.rewrite_position_delete_files('db.sample')
+```
+
+Rewrite all position delete files in table `db.sample`. Dangling deletes are
removed from rewritten delete files.
Review Comment:
Yes. As mentioned in other comment, should probably add docs on all the
options at some point, instead of relying on soft link to javadoc.
##########
docs/spark-procedures.md:
##########
@@ -364,6 +364,53 @@ Rewrite the manifests in table `db.sample` and disable the
use of Spark caching.
CALL catalog_name.system.rewrite_manifests('db.sample', false)
```
+### `rewrite_position_delete_files`
+
+Iceberg can rewrite position delete files, which serves two purposes:
+* Minor Compaction: Compact small position delete files into larger ones.
This reduces size of metadata stored in manifest files and overhead of opening
small delete files.
+* Remove Dangling Deletes: Filter out position delete records that refer to
data files that are no longer live. After rewrite_data_files, position delete
records pointing to the rewritten data files are not immediately marked for
removeal and remain tracked by the table's live snapshot metadata. This is
known as the 'dangling delete' problem, and is because a single position delete
file can apply to more than one data file, and not all applicable data files
are removed during rewrite.
+
+Iceberg can rewrite position delete files in parallel using Spark with the
`rewritePositionDeletes` action.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table` | ✔️ | string | Name of the table to update |
+| `options` | ️ | map<string, string> | Options to be used for actions|
+
+See the [`SizeBasedFileRewriter` Javadoc](../../../javadoc/{{% icebergVersion
%}}/org/apache/iceberg/actions/SizeBasedFileRewriter.html#field.summary),
+for list of all the supported options for this action.
+
+All rewritten position delete files are filtered to remove dangling deletes.
Review Comment:
I like the second one, changed
##########
docs/spark-procedures.md:
##########
@@ -364,6 +364,53 @@ Rewrite the manifests in table `db.sample` and disable the
use of Spark caching.
CALL catalog_name.system.rewrite_manifests('db.sample', false)
```
+### `rewrite_position_delete_files`
+
+Iceberg can rewrite position delete files, which serves two purposes:
+* Minor Compaction: Compact small position delete files into larger ones.
This reduces size of metadata stored in manifest files and overhead of opening
small delete files.
+* Remove Dangling Deletes: Filter out position delete records that refer to
data files that are no longer live. After rewrite_data_files, position delete
records pointing to the rewritten data files are not immediately marked for
removeal and remain tracked by the table's live snapshot metadata. This is
known as the 'dangling delete' problem, and is because a single position delete
file can apply to more than one data file, and not all applicable data files
are removed during rewrite.
+
+Iceberg can rewrite position delete files in parallel using Spark with the
`rewritePositionDeletes` action.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table` | ✔️ | string | Name of the table to update |
+| `options` | ️ | map<string, string> | Options to be used for actions|
+
+See the [`SizeBasedFileRewriter` Javadoc](../../../javadoc/{{% icebergVersion
%}}/org/apache/iceberg/actions/SizeBasedFileRewriter.html#field.summary),
+for list of all the supported options for this action.
+
+All rewritten position delete files are filtered to remove dangling deletes.
+
+#### Output
+
+| Output Name | Type | Description
|
+|--------------------------------|------|-----------------------------------------------------------------------------|
+| `rewritten_delete_files_count` | int | Number of delete files which were
removed by this command |
+| `rewritten_bytes_count` | long | Count of bytes across delete files
which were removed by this command |
+| `rewritten_delete_files_count` | int | Number of delete files which were
added by this command |
Review Comment:
Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]