szehon-ho commented on code in PR #8251:
URL: https://github.com/apache/iceberg/pull/8251#discussion_r1288969923
##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `options` | ️ | map<string, string> | Options to be used for actions|
| `where` | ️ | string | predicate as a string used for filtering the
files. Note that all files that may contain data matching the filter will be
selected for rewriting|
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
+| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
+| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `use-starting-sequence-number` | true | Use the sequence number of the
snapshot at compaction start time instead of that of the newly produced
snapshot |
+| `rewrite-job-order` | none | Force the rewrite job order based on the value
(one of bytes-asc, bytes-desc, files-asc, files-desc, none) |
+| `target-file-size-bytes` | default value of `write.target-file-size-bytes`
from [table properties](../configuration/#write-properties) | Target output
file size |
+| `min-file-size-bytes` | 75% of target file size | Files under this threshold
will be considered for rewriting regardless of any other criteria |
+| `max-file-size-bytes` | 180% of target file size | Files with sizes above
this threshold will be considered for rewriting regardless of any other
criteria |
+| `min-input-files` | 5 | Any file group exceeding this number of files will
be rewritten regardless of other criteria |
+| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
+| `max-file-group-size-bytes` | 107374182400 | Largest amount of data that
should be rewritten in a single file group |
Review Comment:
I think we can also add these original sentences back for context:
```
The entire rewrite operation is broken down into pieces based on
partitioning and within partitions based on size into groups. These sub-units
of the rewrite are referred to as file groups. This helps with breaking down
the rewriting of very large partitions which may not be rewritable otherwise
due to the resource constraints of the cluster.
```
Maybe we can shorten it like:
```
The entire rewrite operation is broken down into pieces based on
partitioning and within partitions based on size into file-groups. This helps
with breaking down the rewriting of very large partitions which may not be
rewritable otherwise due to the resource constraints of the cluster.
```
##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `options` | ️ | map<string, string> | Options to be used for actions|
| `where` | ️ | string | predicate as a string used for filtering the
files. Note that all files that may contain data matching the filter will be
selected for rewriting|
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
+| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
+| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `use-starting-sequence-number` | true | Use the sequence number of the
snapshot at compaction start time instead of that of the newly produced
snapshot |
+| `rewrite-job-order` | none | Force the rewrite job order based on the value
(one of bytes-asc, bytes-desc, files-asc, files-desc, none) |
Review Comment:
I feel we should at least copy the bullet points over , even though its a
bit long (can the table support bullet points?)
##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `options` | ️ | map<string, string> | Options to be used for actions|
| `where` | ️ | string | predicate as a string used for filtering the
files. Note that all files that may contain data matching the filter will be
selected for rewriting|
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
Review Comment:
Yea took a second look, you are right, some of them are too long.
But now we are removing references to javadoc, I suggest to still add back
more context, I added suggestion to individual options.
##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `options` | ️ | map<string, string> | Options to be used for actions|
| `where` | ️ | string | predicate as a string used for filtering the
files. Note that all files that may contain data matching the filter will be
selected for rewriting|
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
+| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
+| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `use-starting-sequence-number` | true | Use the sequence number of the
snapshot at compaction start time instead of that of the newly produced
snapshot |
+| `rewrite-job-order` | none | Force the rewrite job order based on the value
(one of bytes-asc, bytes-desc, files-asc, files-desc, none) |
+| `target-file-size-bytes` | default value of `write.target-file-size-bytes`
from [table properties](../configuration/#write-properties) | Target output
file size |
+| `min-file-size-bytes` | 75% of target file size | Files under this threshold
will be considered for rewriting regardless of any other criteria |
+| `max-file-size-bytes` | 180% of target file size | Files with sizes above
this threshold will be considered for rewriting regardless of any other
criteria |
+| `min-input-files` | 5 | Any file group exceeding this number of files will
be rewritten regardless of other criteria |
+| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
+| `max-file-group-size-bytes` | 107374182400 | Largest amount of data that
should be rewritten in a single file group |
+| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs
to be associated with a data file for it to be considered for rewriting |
+| `compression-factor` | 1.0 | Let user adjust for file size used for
estimating actual output data size (used with sort strategy) |
+| `shuffle-partitions-per-file` | 1 | Number of shuffle partitions to use for
each output file (used with sort strategy) |
Review Comment:
I think we add this sentence back:
```
Iceberg will use a custom coalesce operation to stitch these sorted
partitions back together into a single sorted file.
```
##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `options` | ️ | map<string, string> | Options to be used for actions|
| `where` | ️ | string | predicate as a string used for filtering the
files. Note that all files that may contain data matching the filter will be
selected for rewriting|
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
+| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
+| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `use-starting-sequence-number` | true | Use the sequence number of the
snapshot at compaction start time instead of that of the newly produced
snapshot |
+| `rewrite-job-order` | none | Force the rewrite job order based on the value
(one of bytes-asc, bytes-desc, files-asc, files-desc, none) |
+| `target-file-size-bytes` | default value of `write.target-file-size-bytes`
from [table properties](../configuration/#write-properties) | Target output
file size |
+| `min-file-size-bytes` | 75% of target file size | Files under this threshold
will be considered for rewriting regardless of any other criteria |
+| `max-file-size-bytes` | 180% of target file size | Files with sizes above
this threshold will be considered for rewriting regardless of any other
criteria |
+| `min-input-files` | 5 | Any file group exceeding this number of files will
be rewritten regardless of other criteria |
+| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
+| `max-file-group-size-bytes` | 107374182400 | Largest amount of data that
should be rewritten in a single file group |
+| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs
to be associated with a data file for it to be considered for rewriting |
+| `compression-factor` | 1.0 | Let user adjust for file size used for
estimating actual output data size (used with sort strategy) |
Review Comment:
Yea I think we can add sub-sections like:
- General options
- Options used when sort_order is set
- Options used when strategy is z_order
Please check and correct
##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `options` | ️ | map<string, string> | Options to be used for actions|
| `where` | ️ | string | predicate as a string used for filtering the
files. Note that all files that may contain data matching the filter will be
selected for rewriting|
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
+| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
+| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `use-starting-sequence-number` | true | Use the sequence number of the
snapshot at compaction start time instead of that of the newly produced
snapshot |
+| `rewrite-job-order` | none | Force the rewrite job order based on the value
(one of bytes-asc, bytes-desc, files-asc, files-desc, none) |
+| `target-file-size-bytes` | default value of `write.target-file-size-bytes`
from [table properties](../configuration/#write-properties) | Target output
file size |
+| `min-file-size-bytes` | 75% of target file size | Files under this threshold
will be considered for rewriting regardless of any other criteria |
+| `max-file-size-bytes` | 180% of target file size | Files with sizes above
this threshold will be considered for rewriting regardless of any other
criteria |
+| `min-input-files` | 5 | Any file group exceeding this number of files will
be rewritten regardless of other criteria |
+| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
+| `max-file-group-size-bytes` | 107374182400 | Largest amount of data that
should be rewritten in a single file group |
+| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs
to be associated with a data file for it to be considered for rewriting |
+| `compression-factor` | 1.0 | Let user adjust for file size used for
estimating actual output data size (used with sort strategy) |
Review Comment:
To me, this one needs more context, so I would just put back the original
javadoc.
```
The number of shuffle partitions and consequently the number of output files
created by the Spark sort is based on the size of the input data files used in
this file rewriter. Due to compression, the disk file sizes may not accurately
represent the size of files in the output. This parameter lets the user adjust
the file size used for estimating actual output data size. A factor greater
than 1.0 would generate more files than we would expect based on the on-disk
file size. A value less than 1.0 would create fewer files than we would expect
based on the on-disk size.
```
##########
docs/spark-procedures.md:
##########
@@ -283,11 +283,27 @@ Iceberg can compact data files in parallel using Spark
with the `rewriteDataFile
| `options` | ️ | map<string, string> | Options to be used for actions|
| `where` | ️ | string | predicate as a string used for filtering the
files. Note that all files that may contain data matching the filter will be
selected for rewriting|
+#### Options
+
+| Name | Default Value | Description |
+|------|---------------|-------------|
+| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to
be simultaneously rewritten |
+| `partial-progress.enabled` | false | Enable committing groups of files prior
to the entire rewrite completing |
+| `partial-progress.max-commits` | 10 | Maximum amount of commits that this
rewrite is allowed to produce if partial progress is enabled |
+| `use-starting-sequence-number` | true | Use the sequence number of the
snapshot at compaction start time instead of that of the newly produced
snapshot |
+| `rewrite-job-order` | none | Force the rewrite job order based on the value
(one of bytes-asc, bytes-desc, files-asc, files-desc, none) |
+| `target-file-size-bytes` | default value of `write.target-file-size-bytes`
from [table properties](../configuration/#write-properties) | Target output
file size |
+| `min-file-size-bytes` | 75% of target file size | Files under this threshold
will be considered for rewriting regardless of any other criteria |
+| `max-file-size-bytes` | 180% of target file size | Files with sizes above
this threshold will be considered for rewriting regardless of any other
criteria |
+| `min-input-files` | 5 | Any file group exceeding this number of files will
be rewritten regardless of other criteria |
+| `rewrite-all` | false | Force rewriting of all provided files overriding
other options |
+| `max-file-group-size-bytes` | 107374182400 | Largest amount of data that
should be rewritten in a single file group |
Review Comment:
Also, maybe we can put back "(1 GB)"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]