[GitHub] [iceberg] szehon-ho commented on a diff in pull request #8251: Docs: Add supported options for rewrite_data_files and rewrite_position_delete_files

via GitHub Tue, 08 Aug 2023 14:28:03 -0700


szehon-ho commented on code in PR #8251:
URL: https://github.com/apache/iceberg/pull/8251#discussion_r1287702980



##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `options`     | ️   | map<string, string> | Options to be used for actions|
 | `where`       | ️   | string | predicate as a string used for filtering the 
files. Note that all files that may contain data matching the filter will be 
selected for rewriting|
 
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to 
be simultaneously rewritten |

Review Comment:
   This is great, wondering, can we take the text directly from the javadoc?  
Looks like that has more information.



##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark 
with the `rewriteDataFile
 | `options`     | ️   | map<string, string> | Options to be used for actions|
 | `where`       | ️   | string | predicate as a string used for filtering the 
files. Note that all files that may contain data matching the filter will be 
selected for rewriting|
 
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to 
be simultaneously rewritten |
+| `partial-progress.enabled` | false | Enable committing groups of files prior 
to the entire rewrite completing |
+| `partial-progress.max-commits` | 10 | Maximum amount of commits that this 
rewrite is allowed to produce if partial progress is enabled |
+| `use-starting-sequence-number` | true | Use the sequence number of the 
snapshot at compaction start time instead of that of the newly produced 
snapshot |
+| `rewrite-job-order` | none | Force the rewrite job order based on the value 
(one of bytes-asc, bytes-desc, files-asc, files-desc, none) |
+| `target-file-size-bytes` | default value of `write.target-file-size-bytes` 
from [table properties](../configuration/#write-properties) | Target output 
file size |
+| `min-file-size-bytes` | 75% of target file size | Files under this threshold 
will be considered for rewriting regardless of any other criteria |
+| `max-file-size-bytes` | 180% of target file size | Files with sizes above 
this threshold will be considered for rewriting regardless of any other 
criteria |
+| `min-input-files` | 5 | Any file group exceeding this number of files will 
be rewritten regardless of other criteria |
+| `rewrite-all` | false | Force rewriting of all provided files overriding 
other options |
+| `max-file-group-size-bytes` | 107374182400 | Largest amount of data that 
should be rewritten in a single file group |
+| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs 
to be associated with a data file for it to be considered for rewriting |
+| `compression-factor` | 1.0 | Let user adjust for file size used for 
estimating actual output data size (used with sort strategy) |

Review Comment:
   Do you think it is easier to split out strategy-specific ones into separate 
table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #8251: Docs: Add supported options for rewrite_data_files and rewrite_position_delete_files

Reply via email to