RussellSpitzer commented on a change in pull request #3375:
URL: https://github.com/apache/iceberg/pull/3375#discussion_r743777382
##########
File path: site/docs/spark-procedures.md
##########
@@ -240,6 +240,57 @@ Remove any files in the `tablelocation/data` folder which
are not known to the t
CALL catalog_name.system.remove_orphan_files(table => 'db.sample', location =>
'tablelocation/data')
```
+### `rewrite_data_files`
+
+Iceberg tracks each data file in a table. More data files leads to more
metadata stored in manifest files, and small data files causes an unnecessary
amount of metadata and less efficient queries from file open costs.
+
+Iceberg can compact data files in parallel using Spark with the
`rewriteDataFiles` action. This will combine small files into larger files to
reduce metadata overhead and runtime file open cost.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table` | ✔️ | string | Name of the table to update |
+| `strategy` | | string | Name of the strategy - binpack or sort.
Defaults to binpack strategy |
+| `sort_order` | | string | Comma separated sort_order_column. Where
sort_order_column is a space separated sort order info per column (ColumnName
SortDirection NullOrder). <br/> All three members are mandatory to provide for
sort_order_column. SortDirection can be ASC or DESC. NullOrder can be
NULLS_FIRST or NULLS_LAST |
+| `options` | ️ | map<string, string> | Options to be used for actions|
+| `where` | ️ | string | predicate as a string used for filtering the
files.|
+
+
+See the [`RewriteDataFiles` Javadoc](./javadoc/{{ versions.iceberg
}}/org/apache/iceberg/actions/RewriteDataFiles.html#field.summary),
+<br/> [`BinPackStrategy` Javadoc](./javadoc/{{ versions.iceberg
}}/org/apache/iceberg/actions/BinPackStrategy.html#field.summary)
+and <br/> [`SortStrategy` Javadoc](./javadoc/{{ versions.iceberg
}}/org/apache/iceberg/actions/SortStrategy.html#field.summary)
+for list of all the supported options for this action.
+
+#### Output
+
+| Output Name | Type | Description |
+| ------------|------|-------------|
+| `rewritten_data_files_count` | int | Number of data which were re-written by
this command |
+| `added_data_files_count` | int | Number of new data files which were
written by this command |
+
+#### Examples
+
+Rewrite the data files in table `db.sample` and align data files with table
partitioning.
+```sql
+CALL catalog_name.system.rewrite_data_files('db.sample')
+```
+
+Rewrite the data files in table `db.sample` and use sort strategy with id,
name column as sort order column.
+```sql
+CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy =>
'sort', sort_order => 'id DESC NULLS_LAST,name ASC NULLS_FIRST')
+```
+
+Rewrite the data files in table `db.sample` and use default binpack strategy
with option of `min-input-files` as 2.
Review comment:
Rewrites the data file in the table using bin-pack strategy in any
partition where more than 2 or more files need to be rewritten.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]