[
https://issues.apache.org/jira/browse/FLINK-20122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237917#comment-17237917
]
Yun Gao edited comment on FLINK-20122 at 11/24/20, 7:21 AM:
------------------------------------------------------------
I tested the case with a writer task that writes and compact data to file sink,
and another reader task to check the result is the same as expected:
Normal case on standalone cluster (4 tm, each with 2 slots) and write to HDFS:
# Tested with parameters (checkpoint interval = 10s, record length = 20k, rate
= 1000 records / sec, total records count = 10000, format = csv) and use the
default parameters for FileSystemTableSink (by default the compaction file size
= 128m), the file are indeed compacted to no more than 128m files, and the
result is right.
# Further config sink.rolling-policy.file-size = 1mb,
compaction.file-size=5242880 (5mb), it could be seen that files are merged into
files no large than 5mb, and the result is right.
# Change compaction.file-size to 10m, and now the files are merged into files
no large than 10mb, and the result is right.
# Let the subtask 0 and 1 outputs more data than others and change
sink.rolling-policy.file-size = 128mb, it could be seen that the small files
are merged and the large files generated by 0 and 1 is not merged due to exceed
the threshold. The result is right.
# Repeat 1 - 4 for other formats (json, avro, parquet, orc) and all the
results is right. (The json issue happens in the reader task side).
Normal cases on standalone cluster (4 tm, each with 2 slots) and write to S3:
# Tested the first case for HDFS on S3. The result is still right, however,
since on S3 the compaction is slower (due to network connection and copy), the
backpressure will occur and checkpoint is also delayed. [~lzljs3620320]
[~Leonard Xu] Do you think we need also to warn users about the backpressure in
the document ?
Abnormal cases:
# Test with compaction.file-size = -1, it would throw exception on -1 is not
the expected number value.
was (Author: gaoyunhaii):
I tested the case with a writer task that writes and compact data to file sink,
and another reader task to check the result is the same as expected:
Normal case on standalone cluster (4 tm, each with 2 slots) and write to HDFS:
# Tested with parameters (checkpoint interval = 10s, record length = 20k, rate
= 1000 records / sec, total records count = 10000, format = csv) and use the
default parameters for FileSystemTableSink (by default the compaction file size
= 128m), the file are indeed compacted to no more than 128m files, and the
result is right.
# Further config sink.rolling-policy.file-size = 1mb,
compaction.file-size=5242880 (5mb), it could be seen that files are merged into
files no large than 5mb, and the result is right.
# Change compaction.file-size to 10m, and now the files are merged into files
no large than 10mb, and the result is right.
# Let the subtask 0 and 1 outputs more data than others and change
sink.rolling-policy.file-size = 128mb, it could be seen that the small files
are merged and the large files generated by 0 and 1 is not merged due to exceed
the threshold. The result is right.
# Repeat 1 - 4 for other formats (json, avro, parquet, orc) and all the
results is right.
Normal cases on standalone cluster (4 tm, each with 2 slots) and write to S3:
# Tested the first case for HDFS on S3. The result is still right, however,
since on S3 the compaction is slower (due to network connection and copy), the
backpressure will occur and checkpoint is also delayed. [~lzljs3620320]
[~Leonard Xu] Do you think we need also to warn users about the backpressure in
the document ?
Abnormal cases:
# Test with compaction.file-size = -1, it would throw exception on -1 is not
the expected number value.
> Test Small file compaction
> --------------------------
>
> Key: FLINK-20122
> URL: https://issues.apache.org/jira/browse/FLINK-20122
> Project: Flink
> Issue Type: Sub-task
> Components: Connectors / FileSystem
> Affects Versions: 1.12.0
> Reporter: Robert Metzger
> Assignee: Yun Gao
> Priority: Critical
> Fix For: 1.12.0
>
>
> ----
> [General Information about the Flink 1.12 release
> testing|https://cwiki.apache.org/confluence/display/FLINK/1.12+Release+-+Community+Testing]
> When testing a feature, consider the following aspects:
> - Is the documentation easy to understand
> - Are the error messages, log messages, APIs etc. easy to understand
> - Is the feature working as expected under normal conditions
> - Is the feature working / failing as expected with invalid input, induced
> errors etc.
> If you find a problem during testing, please file a ticket
> (Priority=Critical; Fix Version = 1.12.0), and link it in this testing ticket.
> During the testing keep us updated on tests conducted, or please write a
> short summary of all things you have tested in the end.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)