[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3375: CALL procedure for rewrite_data_files

GitBox Wed, 27 Oct 2021 07:37:03 -0700


RussellSpitzer commented on a change in pull request #3375:
URL: https://github.com/apache/iceberg/pull/3375#discussion_r737537547




##########
File path: site/docs/spark-procedures.md
##########
@@ -240,6 +240,34 @@ Remove any files in the `tablelocation/data` folder which 
are not known to the t
 CALL catalog_name.system.remove_orphan_files(table => 'db.sample', location => 
'tablelocation/data')
 ```
 
+### `rewrite_data_files`
+
+Iceberg tracks each data file in a table. More data files leads to more 
metadata stored in manifest files, and small data files causes an unnecessary 
amount of metadata and less efficient queries from file open costs.
+
+Iceberg can compact data files in parallel using Spark with the 
`rewriteDataFiles` action. This will combine small files into larger files to 
reduce metadata overhead and runtime file open cost.
+
+#### Usage
+
+| Argument Name | Required? | Type | Description |
+|---------------|-----------|------|-------------|
+| `table`       | ✔️  | string | Name of the table to update |
+| `strategy`    |    | string | Name of the strategy - binpack or sort |

Review comment:
       Probably best to include the default here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3375: CALL procedure for rewrite_data_files

Reply via email to