[jira] [Comment Edited] (FLINK-20122) Test Small file compaction

Yun Gao (Jira) Mon, 23 Nov 2020 23:22:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237917#comment-17237917
 ]


Yun Gao edited comment on FLINK-20122 at 11/24/20, 7:21 AM:
------------------------------------------------------------

I tested the case with a writer task that writes and compact data to file sink, 
and another reader task to check the result is the same as expected:

 

Normal case on standalone cluster (4 tm, each with 2 slots) and write to HDFS:
 # Tested with parameters (checkpoint interval = 10s, record length = 20k, rate 
= 1000 records / sec, total records count = 10000, format = csv) and use the 
default parameters for FileSystemTableSink (by default the compaction file size 
= 128m), the file are indeed compacted to no more than 128m files, and the 
result is right.
 # Further config sink.rolling-policy.file-size = 1mb, 
compaction.file-size=5242880 (5mb), it could be seen that files are merged into 
files no large than 5mb, and the result is right.
 # Change compaction.file-size to 10m, and now the files are merged into files 
no large than 10mb, and the result is right.
 # Let the subtask 0 and 1 outputs more data than others and change  
sink.rolling-policy.file-size = 128mb, it could be seen that the small files 
are merged and the large files generated by 0 and 1 is not merged due to exceed 
the threshold. The result is right.
 # Repeat 1 - 4 for other formats (json, avro, parquet, orc) and all the 
results is right. （The json issue happens in the reader task side).

Normal cases on standalone cluster (4 tm, each with 2 slots) and write to S3:
 # Tested the first case for HDFS on S3. The result is still right, however, 
since on S3 the compaction is slower (due to network connection and copy), the 
backpressure will occur and checkpoint is also delayed. [~lzljs3620320] 
[~Leonard Xu] Do you think we need also to warn users about the backpressure in 
the document ？

Abnormal cases:
 # Test with compaction.file-size  = -1, it would throw exception on -1 is not 
the expected number value. 


was (Author: gaoyunhaii):
I tested the case with a writer task that writes and compact data to file sink, 
and another reader task to check the result is the same as expected:

 

Normal case on standalone cluster (4 tm, each with 2 slots) and write to HDFS:
 # Tested with parameters (checkpoint interval = 10s, record length = 20k, rate 
= 1000 records / sec, total records count = 10000, format = csv) and use the 
default parameters for FileSystemTableSink (by default the compaction file size 
= 128m), the file are indeed compacted to no more than 128m files, and the 
result is right.
 # Further config sink.rolling-policy.file-size = 1mb, 
compaction.file-size=5242880 (5mb), it could be seen that files are merged into 
files no large than 5mb, and the result is right.
 # Change compaction.file-size to 10m, and now the files are merged into files 
no large than 10mb, and the result is right.
 # Let the subtask 0 and 1 outputs more data than others and change  
sink.rolling-policy.file-size = 128mb, it could be seen that the small files 
are merged and the large files generated by 0 and 1 is not merged due to exceed 
the threshold. The result is right.
 # Repeat 1 - 4 for other formats (json, avro, parquet, orc) and all the 
results is right.

Normal cases on standalone cluster (4 tm, each with 2 slots) and write to S3:
 # Tested the first case for HDFS on S3. The result is still right, however, 
since on S3 the compaction is slower (due to network connection and copy), the 
backpressure will occur and checkpoint is also delayed. [~lzljs3620320] 
[~Leonard Xu] Do you think we need also to warn users about the backpressure in 
the document ？

Abnormal cases:
 # Test with compaction.file-size  = -1, it would throw exception on -1 is not 
the expected number value. 

> Test Small file compaction
> --------------------------
>
>                 Key: FLINK-20122
>                 URL: https://issues.apache.org/jira/browse/FLINK-20122
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Connectors / FileSystem
>    Affects Versions: 1.12.0
>            Reporter: Robert Metzger
>            Assignee: Yun Gao
>            Priority: Critical
>             Fix For: 1.12.0
>
>
> ----
> [General Information about the Flink 1.12 release 
> testing|https://cwiki.apache.org/confluence/display/FLINK/1.12+Release+-+Community+Testing]
> When testing a feature, consider the following aspects:
> - Is the documentation easy to understand
> - Are the error messages, log messages, APIs etc. easy to understand
> - Is the feature working as expected under normal conditions
> - Is the feature working / failing as expected with invalid input, induced 
> errors etc.
> If you find a problem during testing, please file a ticket 
> (Priority=Critical; Fix Version = 1.12.0), and link it in this testing ticket.
> During the testing keep us updated on tests conducted, or please write a 
> short summary of all things you have tested in the end.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-20122) Test Small file compaction

Reply via email to