[GitHub] [flink] sjwiesman commented on a change in pull request #13990: [FLINK-20053][table][doc] Add document for file compaction

GitBox Mon, 09 Nov 2020 16:45:20 -0800


sjwiesman commented on a change in pull request #13990:
URL: https://github.com/apache/flink/pull/13990#discussion_r520212493




##########
File path: docs/dev/table/connectors/filesystem.md
##########
@@ -150,6 +150,41 @@ become finished on the next checkpoint) control the size 
and number of these par
 **NOTE:** For row formats (csv, json), you can set the parameter 
`sink.rolling-policy.file-size` or `sink.rolling-policy.rollover-interval` in 
the connector properties and parameter `execution.checkpointing.interval` in 
flink-conf.yaml together
 if you don't want to wait a long period before observe the data exists in file 
system. For other formats (avro, orc), you can just set parameter 
`execution.checkpointing.interval` in flink-conf.yaml.
 
+### File Compaction
+
+If you want a smaller checkpoint interval and do not want to generate a large 
number of small files,
+it is recommended that you open file compaction:
+
+<table class="table table-bordered">
+  <thead>
+    <tr>
+        <th class="text-left" style="width: 20%">Key</th>
+        <th class="text-left" style="width: 15%">Default</th>
+        <th class="text-left" style="width: 10%">Type</th>
+        <th class="text-left" style="width: 55%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+        <td><h5>auto-compaction</h5></td>
+        <td style="word-wrap: break-word;">false</td>
+        <td>Boolean</td>
+        <td>Whether to enable automatic compaction in streaming sink or not. 
The data will be written to temporary files. After the checkpoint is completed, 
the temporary files generated by a checkpoint will be compacted. The temporary 
files are invisible before compaction.</td>
+    </tr>
+    <tr>
+        <td><h5>compaction.file-size</h5></td>
+        <td style="word-wrap: break-word;">(none)</td>
+        <td>MemorySize</td>
+        <td>The compaction target file size, the default value is the rolling 
file size.</td>
+    </tr>
+  </tbody>
+</table>
+
+After you open file compaction, small files that are not large enough will be 
merged into large files,
+It is worth noting that:
+- Only files in a single checkpoint are compacted, that is, at least the same 
number of files as the number of checkpoints is generated.
+- The file before merging is invisible, so the visibility of the file may be: 
checkpoint interval + compaction time.

Review comment:
       ```suggestion
   If enabled, file compaction will merge multiple small files into larger 
files based on the target file size.
   When running file compaction in production, please be aware that:
   - Only files in a single checkpoint are compacted, that is, at least the 
same number of files as the number of checkpoints is generated.
   - The file before merging is invisible, so the visibility of the file may 
be: checkpoint interval + compaction time.
   ```

##########
File path: docs/dev/table/connectors/filesystem.md
##########
@@ -150,6 +150,41 @@ become finished on the next checkpoint) control the size 
and number of these par
 **NOTE:** For row formats (csv, json), you can set the parameter 
`sink.rolling-policy.file-size` or `sink.rolling-policy.rollover-interval` in 
the connector properties and parameter `execution.checkpointing.interval` in 
flink-conf.yaml together
 if you don't want to wait a long period before observe the data exists in file 
system. For other formats (avro, orc), you can just set parameter 
`execution.checkpointing.interval` in flink-conf.yaml.
 
+### File Compaction
+
+If you want a smaller checkpoint interval and do not want to generate a large 
number of small files,
+it is recommended that you open file compaction:

Review comment:
       ```suggestion
   The file sink supports file compactions, which allows applications to have 
smaller checkpoint intervals without generating a large number of files.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] sjwiesman commented on a change in pull request #13990: [FLINK-20053][table][doc] Add document for file compaction

Reply via email to