Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Etienne Chauchot Wed, 07 Jul 2021 03:41:06 -0700

Hi,

Thanks Xintong for the very good doc ! I added 2 comments to it.


Best,

Etienne

On 02/07/2021 05:57, Xintong Song wrote:

Thanks all for the positive feedback.

I have updated the wiki page [1], and will send an announcement in a
separate thread, to draw more committers' attention.

Moreover, I've opened FLINK-23212 where we can continue with the discussion
around pure documentation changing PRs.

Thank you~

Xintong Song


[1] https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests

On Wed, Jun 30, 2021 at 5:26 PM Xintong Song <tonysong...@gmail.com> wrote:

I second Tison's opinion.

Similar to how we skip docs_404_check for PRs that do not touch the
documentation, we can skip other stages for PRs that only contain
documentation changes.

In addition to making merging documentation PRs easier, we can also reduce
the workload on CI workers. Especially during the last days of a release
cycle, which is usually the most busy time for the CI workers, and is also
where most documentation efforts take place.

Thank you~

Xintong Song



On Wed, Jun 30, 2021 at 3:56 PM Till Rohrmann <trohrm...@apache.org>
wrote:

I think you are right Tison that docs are a special case and they only
require flink-docs to pass. What I am wondering is how much of a problem
this will be (assuming that we have a decent build stability). The more
exceptions we add, the harder it will be to properly follow the
guidelines.
Maybe we can observe how many docs PRs get delayed/not merged because of
this and then revisit this discussion if needed.

Cheers,
Till

On Wed, Jun 30, 2021 at 8:30 AM tison <wander4...@gmail.com> wrote:

Hi,

There are a number of PRs modifying docs only, but we still require all
tests passed on that.

It is a good proposal we avoid merge PR with "unrelated" failure, but

can

we improve the case where the contributor only works for docs?

For example, base on the file change set, run doc tests only.

Best,
tison.


godfrey he <godfre...@gmail.com> 于2021年6月30日周三 下午2:17写道：

+1 for the proposal. Thanks Xintong!

Best,
Godfrey



Rui Li <lirui.fu...@gmail.com> 于2021年6月30日周三 上午11:36写道：

Thanks Xintong. +1 to the proposal.

On Tue, Jun 29, 2021 at 11:05 AM 刘建刚 <liujiangangp...@gmail.com>

wrote:

+1 for the proposal. Since the test time is long and environment

may

vary,

unstable tests are really annoying for developers. The solution is

welcome.

Best
liujiangang

Jingsong Li <jingsongl...@gmail.com> 于2021年6月29日周二 上午10:31写道：

+1 Thanks Xintong for the update!

Best,
Jingsong

On Mon, Jun 28, 2021 at 6:44 PM Till Rohrmann <

trohrm...@apache.org>

wrote:

+1, thanks for updating the guidelines Xintong!

Cheers,
Till

On Mon, Jun 28, 2021 at 11:49 AM Yangze Guo <

karma...@gmail.com>

wrote:

+1

Thanks Xintong for drafting this doc.

Best,
Yangze Guo

On Mon, Jun 28, 2021 at 5:42 PM JING ZHANG <

beyond1...@gmail.com

wrote:

Thanks Xintong for giving detailed documentation.

The best practice for handling test failure is very

detailed,

it's

good

guidelines document with clear action steps.

+1 to Xintong's proposal.

Xintong Song <tonysong...@gmail.com> 于2021年6月28日周一

下午4:07写道：

Thanks all for the discussion.

Based on the opinions so far, I've drafted the new

guidelines

[1],

as a

potential replacement of the original wiki page [2].

Hopefully this draft has covered the most opinions

discussed

and

consensus

made in this discussion thread.

Looking forward to your feedback.

Thank you~

Xintong Song


[1]

https://docs.google.com/document/d/1uUbxbgbGErBXtmEjhwVhBWG3i6nhQ0LXs96OlntEYnU/edit?usp=sharing

[2]

https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests



On Fri, Jun 25, 2021 at 10:40 PM Piotr Nowojski <

pnowoj...@apache.org>

wrote:

Thanks for the clarification Till. +1 for what you

have

written.

Piotrek

pt., 25 cze 2021 o 16:00 Till Rohrmann <

trohrm...@apache.org

napisał(a):

One quick note for clarification. I don't have

anything

against

builds

running on your personal Azure account and this is

not

what I

understood

under "local environment". For me "local

environment"

means

that

someone

runs the test locally on his machine and then says

that

the

tests have passed locally.

I do agree that there might be a conflict of

interests

if a

PR

author

disables tests. Here I would argue that we don't

have

malignant

committers

which means that every committer will probably first

check

the

respective

ticket for how often the test failed. Then I guess

the

next

step

would

be

to discuss on the ticket whether to disable it or

not.

And

finally,

after

reaching a consensus, it will be disabled. If we see

someone

abusing

this

policy, then we can still think about how to guard

against

it.

But,

honestly, I have very rarely seen such a case. I am

also

ok

to

pull in

the

release manager to make the final call if this

resolves

concerns.

Cheers,
Till

On Fri, Jun 25, 2021 at 9:07 AM Piotr Nowojski <

pnowoj...@apache.org>

wrote:

+1 for the general idea, however I have concerns

about

couple

of

details.

I would first try to not introduce the exception

for

local

builds.

It makes it quite hard for others to verify the

build

and

to

make

sure

that the right things were executed.

I would counter Till's proposal to ignore local

green

builds.

If

committer

is merging and closing a PR, with official azure

failure,

but

there

was a

green build before or in local azure it's IMO

enough

to

leave

the

message:

Latest build failure is a known issue:

FLINK-12345

Green local build: URL

This should address Till's concern about

verification.

On the other hand I have concerns about disabling

tests.*

It

shouldn't

be

the PR author/committer that's disabling a test on

his

own,

as

that's a

conflict of interests*. I have however no problems

with

disabling

test

instabilities that were marked as "blockers"

though,

that

should

work

pretty well. But the important thing here is to

correctly

judge

bumping

priorities of test instabilities based on their

frequency

and

current

general health of the system. I believe that

release

managers

should

be

playing a big role here in deciding on the

guidelines

of

what

should

be a

priority of certain test instabilities.

What I mean by that is two example scenarios:
1. if we have a handful of very frequently failing

tests

and

handful

of

very rarely failing tests (like one reported

failure

and

no

another

occurrence in many months, and let's even say that

the

failure

looks

like

infrastructure/network timeout), we should focus

on

the

frequently

failing

ones, and probably we are safe to ignore for the

time

being

the

rare

issues

- at least until we deal with the most pressing

ones.

2. If we have tons of rarely failing test

instabilities,

we

should

probably

start addressing them as blockers.

I'm using my own conscious and my best judgement

when

I'm

bumping/decreasing priorities of test

instabilities

(and

bugs),

but

as

individual committer I don't have the full

picture.

As

wrote

above, I

think release managers are in a much better

position

to

keep

adjusting

those kind of guidelines.

Best, Piotrek

pt., 25 cze 2021 o 08:10 Yu Li <car...@gmail.com>

napisał(a):

+1 for Xintong's proposal.

For me, resolving problems directly (fixing the

infrastructure

issue,

disabling unstable tests and creating blocker

JIRAs

to

track

the

fix

and

re-enable them asap, etc.) is (in most cases)

better

than

working

around

them (verify locally, manually check and judge

the

failure

as

"unrelated",

etc.), and I believe the proposal could help us

pushing

those

more

"real"

solutions forward.

Best Regards,
Yu


On Fri, 25 Jun 2021 at 10:58, Yangze Guo <

karma...@gmail.com

wrote:

Creating a blocker issue for the manually

disabled

tests

sounds

good

to

me.

Minor: I'm still a bit worried about the

commits

merged

before we

fix

the unstable tests can also break those tests.

Instead

of

letting

the

assigners keep a look at all potentially

related

commits,

they

can

maintain a branch that is periodically synced

with

the

master

branch

while enabling the unstable test. So that they

can

catch

the

breaking

changes asap.

Best,
Yangze Guo

On Thu, Jun 24, 2021 at 9:52 PM Till Rohrmann

trohrm...@apache.org>

wrote:

I like the idea of creating a blocker issue

for a

disabled

test.

This

will

force us to resolve it in a timely manner

and

it

won't

fall

through

the

cracks.

Cheers,
Till

On Thu, Jun 24, 2021 at 8:06 AM Jingsong Li

jingsongl...@gmail.com>

wrote:

+1 to Xintong's proposal

I also have some concerns about unstable

cases.

I think unstable cases can be divided into

these

types:

- Force majeure: For example, network

timeout,

sudden

environmental

collapse, they are accidental and can

always

be

solved

by

triggering

azure

again. Committers should wait for the next

green

azure.

- Obvious mistakes: For example, some

errors

caused

by

obvious

reasons

may

be repaired quickly. At this time, do we

need

to

wait,

or not

wait

and

just

ignore?

- Difficult questions: These problems are

very

difficult

to

find.

There

will be no solution for a while and a

half.

We

don't

even

know

the

reason.

At this time, we should ignore it. (Maybe

it's

judged

by

the

author

of

the

case. But what about the old case whose

author

can't

be

found?)

So, the ignored cases should be the block

of

the

next

release

until

the

reason is found or the case is fixed?  We

need

to

ensure

that

someone

will

take care of these cases, because there

is no

deepening

of

failed

tests, no

one may continue to pay attention to these

cases.

I think this guideline should consider

these

situations,

and

show

how

to

solve them.

Best,
Jingsong

On Thu, Jun 24, 2021 at 10:57 AM Jark Wu <

imj...@gmail.com>

wrote:

Thanks to Xintong for bringing up this

topic,

I'm

+1

in

general.

However, I think it's still not very

clear

how

we

address

the

unstable

tests.
I think this is a very important part of

this

new

guideline.

According to the discussion above, if

some

tests

are

unstable,

we

can

manually disable it.
But I have some questions in my mind:
1) Is the instability judged by the

committer

themselves or

by

some

metrics?
2) Should we log the disable commit in

the

corresponding

issue

and

increase

the priority?
3) What if nobody looks into this issue

and

this

becomes

some

potential

bugs released with the new version?
4) If no person is actively working on

the

issue,

who

should

re-enable

it?

Would it block PRs again?


Best,
Jark


On Thu, 24 Jun 2021 at 10:04, Xintong

Song

tonysong...@gmail.com>

wrote:

Thanks all for the feedback.

@Till @Yangze

I'm also not convinced by the idea of

having

an

exception

for

local

builds.

We need to execute the entire build

(or

at

least

the

failing

stage)

locally, to make sure subsequent test

cases

prevented by

the

failure

one

are all executed. In that case, it's

probably

easier

to

rerun

the

build

on

azure than locally.

Concerning disabling unstable test

cases

that

regularly

block

PRs

from

merging, maybe we can say that such

cases

can

only

be

disabled

when

someone

is actively looking into it, likely

the

person

who

disabled

the

case.

If

this person is no longer actively

working

on

it,

he/she

should

enable

the

case again no matter if it is fixed or

not.

@Jing

Thanks for the suggestions.

+1 to provide guidelines on handling

test

failures.

1. Report the test failures in the

JIRA.

+1 on this. Currently, the release

managers

are

monitoring

the

ci

and

cron

build instabilities and reporting

them on

JIRA.

We

should

also

encourage

other contributors to do that for PRs.

2. Set a deadline to find out the root

cause

and

solve

the

failure

for

the

new created JIRA  because we could

not

block

other

commit

merges

for

long

time

3. What to do if the JIRA has not made

significant

progress

when

reached

to

the deadline time?


I'm not sure about these two. It

feels a

bit

against

the

voluntary

nature

of open source projects.

IMHO, frequent instabilities are more

likely

to

be

upgraded

to

the

critical

/ blocker priority, receive more

attention

and

eventually

get

fixed.

Release managers are also responsible

for

looking

for

assignees

for

such

issues. If a case is still not fixed

soonish,

even

with

all

these

efforts,

I'm not sure how setting a deadline

can

help

this.

4. If we disable the respective tests

temporarily,

we

also

need a

mechanism

to ensure the issue would be

continued

to

be

investigated

in

the

future.

+1. As mentioned above, we may

consider

disabling

such

tests

iff

someone

is

actively working on it.

Thank you~

Xintong Song



On Wed, Jun 23, 2021 at 9:56 PM JING

ZHANG

beyond1...@gmail.com

wrote:

Hi Xintong,
+1 to the proposal.
In order to better comply with the

rule,

it

is

necessary

to

describe

what's

best practice if encountering test

failure

which

seems

unrelated

with

the

current commits.
How to avoid merging PR with test

failures

and

not

blocking

code

merging

for a long time?
I tried to think about the possible

steps,

and

found

there

are

some

detailed problems that need to be

discussed

in

step

further:

1. Report the test failures in the

JIRA.

2. Set a deadline to find out the

root

cause

and

solve

the

failure

for

the

new created JIRA  because we could

not

block

other

commit

merges

for

long

time
     When is a reasonable deadline

here?

3. What to do if the JIRA has not

made

significant

progress

when

reached

to

the deadline time?
     There are several situations as

follows,

maybe

different

cases

need

different approaches.
     1. the JIRA is non-assigned yet
     2. not found the root cause yet
     3. not found a good solution,

but

already

found the

root

cause

     4. found a solution, but it

needs

more

time

to

be

done.

4. If we disable the respective

tests

temporarily,

we

also

need a

mechanism

to ensure the issue would be

continued

to

be

investigated

in

the

future.

Best regards,
JING ZHANG

Stephan Ewen <se...@apache.org>

于2021年6月23日周三

下午8:16写道：

+1 to Xintong's proposal

On Wed, Jun 23, 2021 at 1:53 PM

Till

Rohrmann <

trohrm...@apache.org>

wrote:

I would first try to not

introduce

the

exception

for

local

builds.

It

makes

it quite hard for others to

verify

the

build

and to

make

sure

that

the

right things were executed. If

we

see

that

this

becomes

an

issue

then

we

can revisit this idea.

Cheers,
Till

On Wed, Jun 23, 2021 at 4:19 AM

Yangze

Guo

karma...@gmail.com>

wrote:

+1 for appending this to

community

guidelines for

merging

PRs.

@Till Rohrmann
I agree that with this

approach

unstable

tests

will

not

block

other

commit merges. However, it

might

be

hard

to

prevent

merging

commits

that are related to those

tests

and

should

have

been

passed

them.

It's

true that this judgment can be

made

by

the

committers,

but

no

one

can

ensure the judgment is always

precise

and

so

that

we

have

this

discussion thread.

Regarding the unstable tests,

how

about

adding

another

exception:

committers verify it in their

local

environment

and

comment in

such

cases?

Best,
Yangze Guo

On Tue, Jun 22, 2021 at 8:23

PM

刘建刚 <

liujiangangp...@gmail.com

wrote:

It is a good principle to

run

all

tests

successfully

with any

change.

This

means a lot for project's

stability

and

development.

am big

+1

for

this

proposal.

Best
liujiangang

Till Rohrmann <

trohrm...@apache.org>

于2021年6月22日周二

下午6:36写道：

One way to address the

problem

of

regularly

failing

tests

that

block

merging of PRs is to

disable

the

respective

tests

for

the

time

being.

Of

course, the failing test

then

needs

to

be

fixed.

But

at

least

that

way

we

would not block everyone

from

making

progress.

Cheers,
Till

On Tue, Jun 22, 2021 at

12:00

PM

Arvid

Heise

ar...@apache.org

wrote:

I think this is overall

good

idea.

So +1

from

my

side.

However, I'd like to

put a

higher

priority

on

infrastructure

then,

in

particular docker

image/artifact

caches.

On Tue, Jun 22, 2021 at

11:50

AM

Till

Rohrmann

trohrm...@apache.org

wrote:

Thanks for bringing

this

topic

to

our

attention

Xintong.

think

your

proposal makes a lot

of

sense

and

we

should

follow

it.

It

will

give us

confidence that our

changes

are

working

and

it

might

be a

good

incentive

to

quickly fix build

instabilities.

Hence,

+1.

Cheers,
Till

On Tue, Jun 22, 2021

at

11:12

AM

Xintong

Song <

tonysong...@gmail.com>

wrote:

Hi everyone,

In the past a

couple of

weeks,

I've

observed

several

times

that

PRs

are

merged without a

green

light

from

the

CI

tests,

where

failure

cases

are

considered

*unrelated*.

This

may

not

always

cause

problems,

but

would

increase the chance

of

breaking

our

code

base.

In

fact,

it

has

occurred

to

me twice in the past

few

weeks

that I

had

to

revert a

commit

which

breaks

the master branch

due

to

this.

I think it would be

nicer

to

enforce a

stricter

rule,

that

no

PRs

should

be

merged without

passing

CI.

The problems of

merging

PRs

with

"unrelated"

test

failures

are:

- It's not always

straightforward

to

tell

whether a

test

failures are

related or not.
- It prevents

subsequent

test

cases

from

being

executed,

which

may

fail

relating to the PR

changes.

To make things

easier

for

the

committers,

the

following

exceptions

might

be

considered

acceptable.

- The PR has passed

CI

in

the

contributor's

personal

workspace.

Please

post

the link in such

cases.

- The CI tests have

been

triggered

multiple

times, on

the

same

commit,

and

each stage has at

least

passed

for

once.

Please

also

comment

in

such

cases.

If we all agree on

this,

I'd

update the

community

guidelines

for

merging

PRs wrt. this

proposal.

[1]

Please let me know

what

do

you

think.

Thank you~

Xintong Song


[1]

https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests


--
Best, Jingsong Lee


--
Best, Jingsong Lee


--
Best regards!
Rui Li

Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Reply via email to