[
https://issues.apache.org/jira/browse/FLINK-19928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268477#comment-17268477
]
Robert Metzger commented on FLINK-19928:
----------------------------------------
I haven't spend more time on this after I created the ticket.
But I just went through all documented config options, proposing the following
configuration:
heartbeat.interval: 1000 (default 10000)
metrics.fetcher.update-interval: 1000 (default 10000)
metrics.latency.interval: 1000 (default 0)
metrics.system-resource: true (default false)
metrics.system-resource-probing-interval: 1000 (default 5000)
Randomize these configuration keys:
taskmanager.network.blocking-shuffle.compression.enabled: true
taskmanager.network.blocking-shuffle.type: file / mmap
taskmanager.network.detailed-metrics: true
taskmanager.network.netty.transport: epoll / nio
WDYT?
> Introduce test configuration to detect instabilities better
> -----------------------------------------------------------
>
> Key: FLINK-19928
> URL: https://issues.apache.org/jira/browse/FLINK-19928
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination, Tests
> Reporter: Robert Metzger
> Priority: Major
>
> As part of debugging FLINK-19805, I noticed that invalid system states
> sometimes depend on configuration values.
> For example the "heartbeat.interval" is configured to 10 seconds by default.
> Many tests are not running that long, making it difficult to find test
> failures related to the heartbeat.
> Similarly, to intervals, also retry configurations can cause failures to be
> hidden.
> It will be difficult to spread this to all tests, but adding it to the
> {{MiniClusterResource}} would be a start.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)