Hi,

Thanks Xintong for the very good doc ! I added 2 comments to it.

Best,

Etienne

On 02/07/2021 05:57, Xintong Song wrote:
Thanks all for the positive feedback.

I have updated the wiki page [1], and will send an announcement in a
separate thread, to draw more committers' attention.

Moreover, I've opened FLINK-23212 where we can continue with the discussion
around pure documentation changing PRs.

Thank you~

Xintong Song


[1] https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests

On Wed, Jun 30, 2021 at 5:26 PM Xintong Song <tonysong...@gmail.com> wrote:

I second Tison's opinion.

Similar to how we skip docs_404_check for PRs that do not touch the
documentation, we can skip other stages for PRs that only contain
documentation changes.

In addition to making merging documentation PRs easier, we can also reduce
the workload on CI workers. Especially during the last days of a release
cycle, which is usually the most busy time for the CI workers, and is also
where most documentation efforts take place.

Thank you~

Xintong Song



On Wed, Jun 30, 2021 at 3:56 PM Till Rohrmann <trohrm...@apache.org>
wrote:

I think you are right Tison that docs are a special case and they only
require flink-docs to pass. What I am wondering is how much of a problem
this will be (assuming that we have a decent build stability). The more
exceptions we add, the harder it will be to properly follow the
guidelines.
Maybe we can observe how many docs PRs get delayed/not merged because of
this and then revisit this discussion if needed.

Cheers,
Till

On Wed, Jun 30, 2021 at 8:30 AM tison <wander4...@gmail.com> wrote:

Hi,

There are a number of PRs modifying docs only, but we still require all
tests passed on that.

It is a good proposal we avoid merge PR with "unrelated" failure, but
can
we improve the case where the contributor only works for docs?

For example, base on the file change set, run doc tests only.

Best,
tison.


godfrey he <godfre...@gmail.com> 于2021年6月30日周三 下午2:17写道:

+1 for the proposal. Thanks Xintong!

Best,
Godfrey



Rui Li <lirui.fu...@gmail.com> 于2021年6月30日周三 上午11:36写道:

Thanks Xintong. +1 to the proposal.

On Tue, Jun 29, 2021 at 11:05 AM 刘建刚 <liujiangangp...@gmail.com>
wrote:
+1 for the proposal. Since the test time is long and environment
may
vary,
unstable tests are really annoying for developers. The solution is
welcome.
Best
liujiangang

Jingsong Li <jingsongl...@gmail.com> 于2021年6月29日周二 上午10:31写道:

+1 Thanks Xintong for the update!

Best,
Jingsong

On Mon, Jun 28, 2021 at 6:44 PM Till Rohrmann <
trohrm...@apache.org>
wrote:

+1, thanks for updating the guidelines Xintong!

Cheers,
Till

On Mon, Jun 28, 2021 at 11:49 AM Yangze Guo <
karma...@gmail.com>
wrote:
+1

Thanks Xintong for drafting this doc.

Best,
Yangze Guo

On Mon, Jun 28, 2021 at 5:42 PM JING ZHANG <
beyond1...@gmail.com
wrote:
Thanks Xintong for giving detailed documentation.

The best practice for handling test failure is very
detailed,
it's
a
good
guidelines document with clear action steps.

+1 to Xintong's proposal.

Xintong Song <tonysong...@gmail.com> 于2021年6月28日周一
下午4:07写道:
Thanks all for the discussion.

Based on the opinions so far, I've drafted the new
guidelines
[1],
as a
potential replacement of the original wiki page [2].

Hopefully this draft has covered the most opinions
discussed
and
consensus
made in this discussion thread.

Looking forward to your feedback.

Thank you~

Xintong Song


[1]


https://docs.google.com/document/d/1uUbxbgbGErBXtmEjhwVhBWG3i6nhQ0LXs96OlntEYnU/edit?usp=sharing
[2]

https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests


On Fri, Jun 25, 2021 at 10:40 PM Piotr Nowojski <
pnowoj...@apache.org>
wrote:

Thanks for the clarification Till. +1 for what you
have
written.
Piotrek

pt., 25 cze 2021 o 16:00 Till Rohrmann <
trohrm...@apache.org
napisał(a):
One quick note for clarification. I don't have
anything
against
builds
running on your personal Azure account and this is
not
what I
understood
under "local environment". For me "local
environment"
means
that
someone
runs the test locally on his machine and then says
that
the
tests have passed locally.

I do agree that there might be a conflict of
interests
if a
PR
author
disables tests. Here I would argue that we don't
have
malignant
committers
which means that every committer will probably first
check
the
respective
ticket for how often the test failed. Then I guess
the
next
step
would
be
to discuss on the ticket whether to disable it or
not.
And
finally,
after
reaching a consensus, it will be disabled. If we see
someone
abusing
this
policy, then we can still think about how to guard
against
it.
But,
honestly, I have very rarely seen such a case. I am
also
ok
to
pull in
the
release manager to make the final call if this
resolves
concerns.
Cheers,
Till

On Fri, Jun 25, 2021 at 9:07 AM Piotr Nowojski <
pnowoj...@apache.org>
wrote:

+1 for the general idea, however I have concerns
about
a
couple
of
details.
I would first try to not introduce the exception
for
local
builds.
It makes it quite hard for others to verify the
build
and
to
make
sure
that the right things were executed.

I would counter Till's proposal to ignore local
green
builds.
If
committer
is merging and closing a PR, with official azure
failure,
but
there
was a
green build before or in local azure it's IMO
enough
to
leave
the
message:
Latest build failure is a known issue:
FLINK-12345
Green local build: URL
This should address Till's concern about
verification.
On the other hand I have concerns about disabling
tests.*
It
shouldn't
be
the PR author/committer that's disabling a test on
his
own,
as
that's a
conflict of interests*. I have however no problems
with
disabling
test
instabilities that were marked as "blockers"
though,
that
should
work
pretty well. But the important thing here is to
correctly
judge
bumping
priorities of test instabilities based on their
frequency
and
current
general health of the system. I believe that
release
managers
should
be
playing a big role here in deciding on the
guidelines
of
what
should
be a
priority of certain test instabilities.

What I mean by that is two example scenarios:
1. if we have a handful of very frequently failing
tests
and
a
handful
of
very rarely failing tests (like one reported
failure
and
no
another
occurrence in many months, and let's even say that
the
failure
looks
like
infrastructure/network timeout), we should focus
on
the
frequently
failing
ones, and probably we are safe to ignore for the
time
being
the
rare
issues
- at least until we deal with the most pressing
ones.
2. If we have tons of rarely failing test
instabilities,
we
should
probably
start addressing them as blockers.

I'm using my own conscious and my best judgement
when
I'm
bumping/decreasing priorities of test
instabilities
(and
bugs),
but
as
individual committer I don't have the full
picture.
As
I
wrote
above, I
think release managers are in a much better
position
to
keep
adjusting
those kind of guidelines.

Best, Piotrek

pt., 25 cze 2021 o 08:10 Yu Li <car...@gmail.com>
napisał(a):
+1 for Xintong's proposal.

For me, resolving problems directly (fixing the
infrastructure
issue,
disabling unstable tests and creating blocker
JIRAs
to
track
the
fix
and
re-enable them asap, etc.) is (in most cases)
better
than
working
around
them (verify locally, manually check and judge
the
failure
as
"unrelated",
etc.), and I believe the proposal could help us
pushing
those
more
"real"
solutions forward.

Best Regards,
Yu


On Fri, 25 Jun 2021 at 10:58, Yangze Guo <
karma...@gmail.com
wrote:
Creating a blocker issue for the manually
disabled
tests
sounds
good
to
me.
Minor: I'm still a bit worried about the
commits
merged
before we
fix
the unstable tests can also break those tests.
Instead
of
letting
the
assigners keep a look at all potentially
related
commits,
they
can
maintain a branch that is periodically synced
with
the
master
branch
while enabling the unstable test. So that they
can
catch
the
breaking
changes asap.

Best,
Yangze Guo

On Thu, Jun 24, 2021 at 9:52 PM Till Rohrmann
<
trohrm...@apache.org>
wrote:
I like the idea of creating a blocker issue
for a
disabled
test.
This
will
force us to resolve it in a timely manner
and
it
won't
fall
through
the
cracks.

Cheers,
Till

On Thu, Jun 24, 2021 at 8:06 AM Jingsong Li
<
jingsongl...@gmail.com>
wrote:
+1 to Xintong's proposal

I also have some concerns about unstable
cases.
I think unstable cases can be divided into
these
types:
- Force majeure: For example, network
timeout,
sudden
environmental
collapse, they are accidental and can
always
be
solved
by
triggering
azure
again. Committers should wait for the next
green
azure.
- Obvious mistakes: For example, some
errors
caused
by
obvious
reasons
may
be repaired quickly. At this time, do we
need
to
wait,
or not
wait
and
just
ignore?

- Difficult questions: These problems are
very
difficult
to
find.
There
will be no solution for a while and a
half.
We
don't
even
know
the
reason.
At this time, we should ignore it. (Maybe
it's
judged
by
the
author
of
the
case. But what about the old case whose
author
can't
be
found?)
So, the ignored cases should be the block
of
the
next
release
until
the
reason is found or the case is fixed?  We
need
to
ensure
that
someone
will
take care of these cases, because there
is no
deepening
of
failed
tests, no
one may continue to pay attention to these
cases.
I think this guideline should consider
these
situations,
and
show
how
to
solve them.

Best,
Jingsong

On Thu, Jun 24, 2021 at 10:57 AM Jark Wu <
imj...@gmail.com>
wrote:
Thanks to Xintong for bringing up this
topic,
I'm
+1
in
general.
However, I think it's still not very
clear
how
we
address
the
unstable
tests.
I think this is a very important part of
this
new
guideline.
According to the discussion above, if
some
tests
are
unstable,
we
can
manually disable it.
But I have some questions in my mind:
1) Is the instability judged by the
committer
themselves or
by
some
metrics?
2) Should we log the disable commit in
the
corresponding
issue
and
increase
the priority?
3) What if nobody looks into this issue
and
this
becomes
some
potential
bugs released with the new version?
4) If no person is actively working on
the
issue,
who
should
re-enable
it?
Would it block PRs again?


Best,
Jark


On Thu, 24 Jun 2021 at 10:04, Xintong
Song
<
tonysong...@gmail.com>
wrote:
Thanks all for the feedback.

@Till @Yangze

I'm also not convinced by the idea of
having
an
exception
for
local
builds.
We need to execute the entire build
(or
at
least
the
failing
stage)
locally, to make sure subsequent test
cases
prevented by
the
failure
one
are all executed. In that case, it's
probably
easier
to
rerun
the
build
on
azure than locally.

Concerning disabling unstable test
cases
that
regularly
block
PRs
from
merging, maybe we can say that such
cases
can
only
be
disabled
when
someone
is actively looking into it, likely
the
person
who
disabled
the
case.
If
this person is no longer actively
working
on
it,
he/she
should
enable
the
case again no matter if it is fixed or
not.
@Jing

Thanks for the suggestions.

+1 to provide guidelines on handling
test
failures.
1. Report the test failures in the
JIRA.
+1 on this. Currently, the release
managers
are
monitoring
the
ci
and
cron
build instabilities and reporting
them on
JIRA.
We
should
also
encourage
other contributors to do that for PRs.

2. Set a deadline to find out the root
cause
and
solve
the
failure
for
the
new created JIRA  because we could
not
block
other
commit
merges
for
a
long
time

3. What to do if the JIRA has not made
significant
progress
when
reached
to
the deadline time?

I'm not sure about these two. It
feels a
bit
against
the
voluntary
nature
of open source projects.

IMHO, frequent instabilities are more
likely
to
be
upgraded
to
the
critical
/ blocker priority, receive more
attention
and
eventually
get
fixed.
Release managers are also responsible
for
looking
for
assignees
for
such
issues. If a case is still not fixed
soonish,
even
with
all
these
efforts,
I'm not sure how setting a deadline
can
help
this.
4. If we disable the respective tests
temporarily,
we
also
need a
mechanism
to ensure the issue would be
continued
to
be
investigated
in
the
future.
+1. As mentioned above, we may
consider
disabling
such
tests
iff
someone
is
actively working on it.

Thank you~

Xintong Song



On Wed, Jun 23, 2021 at 9:56 PM JING
ZHANG
<
beyond1...@gmail.com
wrote:
Hi Xintong,
+1 to the proposal.
In order to better comply with the
rule,
it
is
necessary
to
describe
what's
best practice if encountering test
failure
which
seems
unrelated
with
the
current commits.
How to avoid merging PR with test
failures
and
not
blocking
code
merging
for a long time?
I tried to think about the possible
steps,
and
found
there
are
some
detailed problems that need to be
discussed
in
a
step
further:
1. Report the test failures in the
JIRA.
2. Set a deadline to find out the
root
cause
and
solve
the
failure
for
the
new created JIRA  because we could
not
block
other
commit
merges
for
a
long
time
     When is a reasonable deadline
here?
3. What to do if the JIRA has not
made
significant
progress
when
reached
to
the deadline time?
     There are several situations as
follows,
maybe
different
cases
need
different approaches.
     1. the JIRA is non-assigned yet
     2. not found the root cause yet
     3. not found a good solution,
but
already
found the
root
cause
     4. found a solution, but it
needs
more
time
to
be
done.
4. If we disable the respective
tests
temporarily,
we
also
need a
mechanism
to ensure the issue would be
continued
to
be
investigated
in
the
future.
Best regards,
JING ZHANG

Stephan Ewen <se...@apache.org>
于2021年6月23日周三
下午8:16写道:
+1 to Xintong's proposal

On Wed, Jun 23, 2021 at 1:53 PM
Till
Rohrmann <
trohrm...@apache.org>
wrote:

I would first try to not
introduce
the
exception
for
local
builds.
It
makes
it quite hard for others to
verify
the
build
and to
make
sure
that
the
right things were executed. If
we
see
that
this
becomes
an
issue
then
we
can revisit this idea.

Cheers,
Till

On Wed, Jun 23, 2021 at 4:19 AM
Yangze
Guo
<
karma...@gmail.com>
wrote:
+1 for appending this to
community
guidelines for
merging
PRs.
@Till Rohrmann
I agree that with this
approach
unstable
tests
will
not
block
other
commit merges. However, it
might
be
hard
to
prevent
merging
commits
that are related to those
tests
and
should
have
been
passed
them.
It's
true that this judgment can be
made
by
the
committers,
but
no
one
can
ensure the judgment is always
precise
and
so
that
we
have
this
discussion thread.

Regarding the unstable tests,
how
about
adding
another
exception:
committers verify it in their
local
environment
and
comment in
such
cases?

Best,
Yangze Guo

On Tue, Jun 22, 2021 at 8:23
PM
刘建刚 <
liujiangangp...@gmail.com
wrote:
It is a good principle to
run
all
tests
successfully
with any
change.
This
means a lot for project's
stability
and
development.
I
am big
+1
for
this
proposal.

Best
liujiangang

Till Rohrmann <
trohrm...@apache.org>
于2021年6月22日周二
下午6:36写道:
One way to address the
problem
of
regularly
failing
tests
that
block
merging of PRs is to
disable
the
respective
tests
for
the
time
being.
Of
course, the failing test
then
needs
to
be
fixed.
But
at
least
that
way
we
would not block everyone
from
making
progress.
Cheers,
Till

On Tue, Jun 22, 2021 at
12:00
PM
Arvid
Heise
<
ar...@apache.org
wrote:
I think this is overall
a
good
idea.
So +1
from
my
side.
However, I'd like to
put a
higher
priority
on
infrastructure
then,
in
particular docker
image/artifact
caches.
On Tue, Jun 22, 2021 at
11:50
AM
Till
Rohrmann
<
trohrm...@apache.org
wrote:

Thanks for bringing
this
topic
to
our
attention
Xintong.
I
think
your
proposal makes a lot
of
sense
and
we
should
follow
it.
It
will
give us
confidence that our
changes
are
working
and
it
might
be a
good
incentive
to
quickly fix build
instabilities.
Hence,
+1.
Cheers,
Till

On Tue, Jun 22, 2021
at
11:12
AM
Xintong
Song <
tonysong...@gmail.com>
wrote:

Hi everyone,

In the past a
couple of
weeks,
I've
observed
several
times
that
PRs
are
merged without a
green
light
from
the
CI
tests,
where
failure
cases
are
considered
*unrelated*.
This
may
not
always
cause
problems,
but
would
increase the chance
of
breaking
our
code
base.
In
fact,
it
has
occurred
to
me twice in the past
few
weeks
that I
had
to
revert a
commit
which
breaks
the master branch
due
to
this.
I think it would be
nicer
to
enforce a
stricter
rule,
that
no
PRs
should
be
merged without
passing
CI.
The problems of
merging
PRs
with
"unrelated"
test
failures
are:
- It's not always
straightforward
to
tell
whether a
test
failures are
related or not.
- It prevents
subsequent
test
cases
from
being
executed,
which
may
fail
relating to the PR
changes.
To make things
easier
for
the
committers,
the
following
exceptions
might
be
considered
acceptable.
- The PR has passed
CI
in
the
contributor's
personal
workspace.
Please
post
the link in such
cases.
- The CI tests have
been
triggered
multiple
times, on
the
same
commit,
and
each stage has at
least
passed
for
once.
Please
also
comment
in
such
cases.
If we all agree on
this,
I'd
update the
community
guidelines
for
merging
PRs wrt. this
proposal.
[1]
Please let me know
what
do
you
think.
Thank you~

Xintong Song


[1]

https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests

--
Best, Jingsong Lee


--
Best, Jingsong Lee


--
Best regards!
Rui Li

Reply via email to