[ 
https://issues.apache.org/jira/browse/FLINK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Khachatryan updated FLINK-20103:
--------------------------------------
    Description: 
This is a follow-up ticket after FLINK-20097.

With the current setup (UnalignedITCase):
 - race conditions are not detected reliably (1 per tens of runs)
 - require changing the configuration (low checkpoint timeout)
 - adding a new job graph often reveals a new bug

An additional issue with the current setup is that it's difficult to git bisect 
(for long ranges). 

Changes that might hide the bugs:
 - having Preconditions in ChannelStatePersister (slow down processing)
 - some Preconditions may mask errors by causing job restart
 - timings in tests (UnalignedITCase)

 Some options to consider
 # chaos monkey tests including induced latency and/or CPU bursts - on 
different workloads/configs
 # side-by-side tests with randomized inputs/configs

Extending Jepsen coverage further (validating output) does not seem promising 
in the context of Flink because it's output isn't linearisable.
 

  was:
This is a follow-up ticket after FLINK-20097.

The current setup (UnalignedITCase) doesn't reveal:
 - race conditions (1 per tens of runs)
 - or bugs triggered in some specific setup (low checkpoint timeout)

An additional issue with the current setup is that it's difficult to git bisect 
(for long ranges).

 

Changes that might hide the bugs:
 - having Preconditions in ChannelStatePersister (slow down processing)
 - some Preconditions may mask errors by causing job restart
 - timings in tests (UnalignedITCase)

 

Some options to consider
 # chaos monkey tests including induced latency and/or CPU bursts - on 
different workloads/configs

 # side-by-side tests with randomized inputs/configs

Extending Jepsen coverage further (validating output) does not seem promising 
in the context of Flink because it's output isn't linearisable.
 


> Improve test coverage for network stack
> ---------------------------------------
>
>                 Key: FLINK-20103
>                 URL: https://issues.apache.org/jira/browse/FLINK-20103
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Network, Tests
>            Reporter: Roman Khachatryan
>            Assignee: Roman Khachatryan
>            Priority: Major
>             Fix For: 1.13.0
>
>
> This is a follow-up ticket after FLINK-20097.
> With the current setup (UnalignedITCase):
>  - race conditions are not detected reliably (1 per tens of runs)
>  - require changing the configuration (low checkpoint timeout)
>  - adding a new job graph often reveals a new bug
> An additional issue with the current setup is that it's difficult to git 
> bisect (for long ranges). 
> Changes that might hide the bugs:
>  - having Preconditions in ChannelStatePersister (slow down processing)
>  - some Preconditions may mask errors by causing job restart
>  - timings in tests (UnalignedITCase)
>  Some options to consider
>  # chaos monkey tests including induced latency and/or CPU bursts - on 
> different workloads/configs
>  # side-by-side tests with randomized inputs/configs
> Extending Jepsen coverage further (validating output) does not seem promising 
> in the context of Flink because it's output isn't linearisable.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to