Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-12-06 Thread Maximilian Michels
        >          >     Will give it another look tomorrow.
 >          >          >
 >          >          >     Thanks,
 >          >          >     Max
 >          >          >
 >          >          >     On 22.11.18 13:07, Maximilian Michels wrote:
 >          >          >      > Hi Alex,
 >          >          >      >
 >          >          >      > The test seems to have gotten flaky after we
merged
 >          >         support for
 >          >          >     portable
 >          >          >      > timers in Flink's batch mode.
 >          >          >      >
 >          >          >      > Looking into this now.
 >          >          >      >
 >          >          >      > Thanks,
 >          >          >      > Max
 >          >          >      >
 >          >          >      > On 21.11.18 23:56, Alex Amato wrote:
 >          >          >      >> Hello, I have noticed
 >          >          >      >>
 >          >   
  that org.apache.beam.runners.flink.PortableTimersExecutionTest

 >          >          >     is very
 >          >          >      >> flakey, and repro'd this test timeout on
the master
 >          >         branch in
 >          >          >     40/50 runs.
 >          >          >      >>
 >          >          >      >> I filed a JIRA issue: BEAM-6111
 >          >          >      >>
<https://issues.apache.org/jira/browse/BEAM-6111>. I
 >          >         was just
 >          >          >      >> wondering if anyone knew why this may be
occurring,
 >          >         and to check if
 >          >          >      >> anyone else has been experiencing this.
 >          >          >      >>
 >          >          >      >> Thanks,
 >          >          >      >> Alex
 >          >          >
 >          >
 >



Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-12-05 Thread Alex Amato
 1000 test runs without a single failure.
> >  >
> >  > The EmbeddedEnvironment uses direct channels which
> are marked
> >  > experimental in GRPC. We may have to convert them to
> regular
> > socket
> >  > communication.
> >  >
> >  > 2) Try setting a conditional breakpoint in
> GrpcStateService
> >  > which will
> >  > never break, e.g. "false". Set it here:
> >  >
> >
> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134
> >  >
> >  > The tests will never fail. The SDK harness is always
> shutdown
> >  > correctly
> >  > at the end of the test.
> >  >
> >  > Thanks,
> >  > Max
> >  >
> >  > On 26.11.18 19:15, Alex Amato wrote:
> >  >  > Thanks Maximilian, let me know if you need any
> help. Usually
> >  > I debug
> >  >  > this sort of thing by pausing the IntelliJ
> debugger to see
> >  > all the
> >  >  > different threads which are waiting on various
> conditions. If
> >  > you find
> >  >  > any insights from that, please post them here and
> we can try
> >  > to figure
> >  >  > out the source of the stuckness. Perhaps it may be
> some
> >  > concurrency
> >  >  > issue leading to deadlock?
> >  >  >
> >  >  > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels
> >  > mailto:m...@apache.org>
> > <mailto:m...@apache.org <mailto:m...@apache.org>>
> >  >  > <mailto:m...@apache.org <mailto:m...@apache.org>
> > <mailto:m...@apache.org <mailto:m...@apache.org>>>> wrote:
> >  >  >
> >  >  > I couldn't fix it thus far. The issue does not
> seem to be
> >  > in the Flink
> >  >  > Runner but in the way the tests utilizes the
> EMBEDDED
> >  > environment to
> >  >  >     run
> >  >  > multiple portable jobs in a row.
> >  >  >
> >  >  > When it gets stuck it is in RemoteBundle#close
> and it is
> >  > independent of
> >  >  > the test type (batch and streaming have
> different
> >  > implementations).
> >  >  >
> >  >  > Will give it another look tomorrow.
> >  >  >
> >  >  > Thanks,
> >  >  > Max
> >  >  >
> >  >  > On 22.11.18 13:07, Maximilian Michels wrote:
> >  >  >  > Hi Alex,
> >  >  >  >
> >  >  >  > The test seems to have gotten flaky after
> we merged
> >  > support for
> >  >  > portable
> >  >  >  > timers in Flink's batch mode.
> >  >  >  >
> >  >  >  > Looking into this now.
> >  >  >  >
> >  >  >  > Thanks,
> >  >  >  > Max
> >  >  >  >
> >  >  >  > On 21.11.18 23:56, Alex Amato wrote:
> >  >  >  >> Hello, I have noticed
> >  >  >  >>
> >  >
>  that org.apache.beam.runners.flink.PortableTimersExecutionTest
> >  >  > is very
> >  >  >  >> flakey, and repro'd this test timeout on
> the master
> >  > branch in
> >  >  > 40/50 runs.
> >  >  >  >>
> >  >  >  >> I filed a JIRA issue: BEAM-6111
> >  >  >  >> <
> https://issues.apache.org/jira/browse/BEAM-6111>. I
> >  > was just
> >  >  >  >> wondering if anyone knew why this may be
> occurring,
> >  > and to check if
> >  >  >  >> anyone else has been experiencing this.
> >  >  >  >>
> >  >  >  >> Thanks,
> >  >  >  >> Alex
> >  >  >
> >  >
> >
>


Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-12-05 Thread Maximilian Michels
ted to how the cleanup is
 >         performed
 >         for the EmbeddedEnvironmentFactory. The environment is 
shutdown
 >         when the
 >         SDK Harness exists but the GRPC threads continue to linger 
for
 >         some time
 >         and may stall state processing on the next test.
 >
 >         If you do _not_ close DefaultJobBundleFactory, which happens
during
 >         close() or dispose() in the FlinkExecutableStageFunction or
 >         ExecutableStageDoFnOperator respectively, the tests run just
 >         fine. I ran
 >         1000 test runs without a single failure.
 >
 >         The EmbeddedEnvironment uses direct channels which are marked
 >         experimental in GRPC. We may have to convert them to regular
socket
 >         communication.
 >
 >         2) Try setting a conditional breakpoint in GrpcStateService
 >         which will
 >         never break, e.g. "false". Set it here:
 >

https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134
 >
 >         The tests will never fail. The SDK harness is always shutdown
 >         correctly
 >         at the end of the test.
 >
 >         Thanks,
 >         Max
 >
 >         On 26.11.18 19:15, Alex Amato wrote:
 >          > Thanks Maximilian, let me know if you need any help. 
Usually
 >         I debug
 >          > this sort of thing by pausing the IntelliJ debugger to see
 >         all the
 >          > different threads which are waiting on various 
conditions. If
 >         you find
 >          > any insights from that, please post them here and we can 
try
 >         to figure
 >          > out the source of the stuckness. Perhaps it may be some
 >         concurrency
 >          > issue leading to deadlock?
 >          >
 >          > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels
 >         mailto:m...@apache.org>
<mailto:m...@apache.org <mailto:m...@apache.org>>
 >          > <mailto:m...@apache.org <mailto:m...@apache.org>
<mailto:m...@apache.org <mailto:m...@apache.org>>>> wrote:
 >          >
 >          >     I couldn't fix it thus far. The issue does not seem 
to be
 >         in the Flink
 >          >     Runner but in the way the tests utilizes the EMBEDDED
 >         environment to
 >          >     run
 >          >     multiple portable jobs in a row.
 >          >
 >          >     When it gets stuck it is in RemoteBundle#close and it 
is
 >         independent of
 >          >     the test type (batch and streaming have different
 >         implementations).
 >          >
 >          >     Will give it another look tomorrow.
 >          >
 >          >     Thanks,
 >          >     Max
 >          >
 >          >     On 22.11.18 13:07, Maximilian Michels wrote:
 >          >      > Hi Alex,
 >          >      >
 >          >      > The test seems to have gotten flaky after we merged
 >         support for
 >          >     portable
 >          >      > timers in Flink's batch mode.
 >          >      >
 >          >      > Looking into this now.
 >          >      >
 >          >      > Thanks,
 >          >      > Max
 >          >      >
 >          >      > On 21.11.18 23:56, Alex Amato wrote:
 >          >      >> Hello, I have noticed
 >          >      >>
 >         that 
org.apache.beam.runners.flink.PortableTimersExecutionTest
 >          >     is very
 >          >      >> flakey, and repro'd this test timeout on the 
master
 >         branch in
 >          >     40/50 runs.
 >          >      >>
 >          >      >> I filed a JIRA issue: BEAM-6111
 >          >      >> 
<https://issues.apache.org/jira/browse/BEAM-6111>. I
 >         was just
 >          >      >> wondering if anyone knew why this may be 
occurring,
 >         and to check if
 >          >      >> anyone else has been experiencing this.
 >          >      >>
 >          >      >> Thanks,
 >          >      >> Alex
 >          >
 >



Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-12-04 Thread Alex Amato
orrectly
>> > at the end of the test.
>> >
>> > Thanks,
>> > Max
>> >
>> > On 26.11.18 19:15, Alex Amato wrote:
>> >  > Thanks Maximilian, let me know if you need any help. Usually
>> > I debug
>> >  > this sort of thing by pausing the IntelliJ debugger to see
>> > all the
>> >  > different threads which are waiting on various conditions. If
>> > you find
>> >  > any insights from that, please post them here and we can try
>> > to figure
>> >  > out the source of the stuckness. Perhaps it may be some
>> > concurrency
>> >  > issue leading to deadlock?
>> >  >
>> >  > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels
>> > mailto:m...@apache.org>
>> >  > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
>> >  >
>> >  > I couldn't fix it thus far. The issue does not seem to be
>> > in the Flink
>> >  > Runner but in the way the tests utilizes the EMBEDDED
>> > environment to
>> >  > run
>> >  > multiple portable jobs in a row.
>> >  >
>> >  > When it gets stuck it is in RemoteBundle#close and it is
>> >     independent of
>> >  > the test type (batch and streaming have different
>> > implementations).
>> >  >
>> >  > Will give it another look tomorrow.
>> >  >
>> >  > Thanks,
>> >  > Max
>> >  >
>> >  > On 22.11.18 13:07, Maximilian Michels wrote:
>> >  >  > Hi Alex,
>> >  >  >
>> >  >  > The test seems to have gotten flaky after we merged
>> > support for
>> >  > portable
>> >  >  > timers in Flink's batch mode.
>> >  >  >
>> >  >  > Looking into this now.
>> >  >  >
>> >  >  > Thanks,
>> >  >  > Max
>> >  >  >
>> >  >  > On 21.11.18 23:56, Alex Amato wrote:
>> >  >  >> Hello, I have noticed
>> >  >  >>
>> > that org.apache.beam.runners.flink.PortableTimersExecutionTest
>> >  > is very
>> >  >  >> flakey, and repro'd this test timeout on the master
>> > branch in
>> >  > 40/50 runs.
>> >  >  >>
>> >  >  >> I filed a JIRA issue: BEAM-6111
>> >  >  >> <https://issues.apache.org/jira/browse/BEAM-6111>. I
>> > was just
>> >  >  >> wondering if anyone knew why this may be occurring,
>> > and to check if
>> >  >  >> anyone else has been experiencing this.
>> >  >  >>
>> >  >  >> Thanks,
>> >  >  >> Alex
>> >  >
>> >
>>
>


Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-12-04 Thread Alex Amato
> I couldn't fix it thus far. The issue does not seem to be
> > in the Flink
> >  > Runner but in the way the tests utilizes the EMBEDDED
> > environment to
> >  > run
> >  > multiple portable jobs in a row.
> >  >
> >  > When it gets stuck it is in RemoteBundle#close and it is
> > independent of
> >  > the test type (batch and streaming have different
> > implementations).
> >  >
> >  > Will give it another look tomorrow.
> >  >
> >  > Thanks,
> >  > Max
> >  >
> >  >     On 22.11.18 13:07, Maximilian Michels wrote:
> >  >  > Hi Alex,
> >  >  >
> >  >  > The test seems to have gotten flaky after we merged
> > support for
> >  > portable
> >  >  > timers in Flink's batch mode.
> >  >  >
> >  >  > Looking into this now.
> >  >  >
> >  >  > Thanks,
> >  >  > Max
> >  >  >
> >  >  > On 21.11.18 23:56, Alex Amato wrote:
> >  >  >> Hello, I have noticed
> >  >  >>
> > that org.apache.beam.runners.flink.PortableTimersExecutionTest
> >  > is very
> >  >  >> flakey, and repro'd this test timeout on the master
> > branch in
> >  > 40/50 runs.
> >  >  >>
> >  >  >> I filed a JIRA issue: BEAM-6111
> >  >  >> <https://issues.apache.org/jira/browse/BEAM-6111>. I
> > was just
> >  >  >> wondering if anyone knew why this may be occurring,
> > and to check if
> >  >  >> anyone else has been experiencing this.
> >  >  >>
> >  >  >> Thanks,
> >  >  >> Alex
> >  >
> >
>


Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-30 Thread Maximilian Michels
This turned out to be a tricky bug. Robert and me had a joined debugging 
session and managed to find the culprit.


PR pending: https://github.com/apache/beam/pull/7171

On 27.11.18 19:35, Kenneth Knowles wrote:
I actually didn't look at this one. I filed a bunch more adjacent flake 
bugs. I didn't find your bug but I do see that test flaking at the same 
time as the others. FWIW here is the list of flakes and sickbayed tests: 
https://issues.apache.org/jira/issues/?filter=12343195


Kenn

On Tue, Nov 27, 2018 at 10:25 AM Alex Amato <mailto:ajam...@google.com>> wrote:


+Ken,

Did you happen to look into this test? I heard that you may have
been looking into this.

On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels mailto:m...@apache.org>> wrote:

Hi Alex,

Thanks for your help! I'm quite used to debugging
concurrent/distributed
problems. But this one is quite tricky, especially with regards
to GRPC
threads. I try to provide more information in the following.

There are two observations:

1) The problem is specifically related to how the cleanup is
performed
for the EmbeddedEnvironmentFactory. The environment is shutdown
when the
SDK Harness exists but the GRPC threads continue to linger for
some time
and may stall state processing on the next test.

If you do _not_ close DefaultJobBundleFactory, which happens during
close() or dispose() in the FlinkExecutableStageFunction or
ExecutableStageDoFnOperator respectively, the tests run just
fine. I ran
1000 test runs without a single failure.

The EmbeddedEnvironment uses direct channels which are marked
experimental in GRPC. We may have to convert them to regular socket
communication.

2) Try setting a conditional breakpoint in GrpcStateService
which will
never break, e.g. "false". Set it here:

https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134

The tests will never fail. The SDK harness is always shutdown
correctly
at the end of the test.

Thanks,
Max

On 26.11.18 19:15, Alex Amato wrote:
 > Thanks Maximilian, let me know if you need any help. Usually
I debug
 > this sort of thing by pausing the IntelliJ debugger to see
all the
 > different threads which are waiting on various conditions. If
you find
 > any insights from that, please post them here and we can try
to figure
 > out the source of the stuckness. Perhaps it may be some
concurrency
 > issue leading to deadlock?
 >
 > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels
mailto:m...@apache.org>
 > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >
 >     I couldn't fix it thus far. The issue does not seem to be
in the Flink
 >     Runner but in the way the tests utilizes the EMBEDDED
environment to
 >     run
 >     multiple portable jobs in a row.
 >
 >     When it gets stuck it is in RemoteBundle#close and it is
independent of
 >     the test type (batch and streaming have different
implementations).
 >
 >     Will give it another look tomorrow.
 >
 >     Thanks,
 >     Max
 >
 >     On 22.11.18 13:07, Maximilian Michels wrote:
 >      > Hi Alex,
 >      >
 >      > The test seems to have gotten flaky after we merged
support for
 >     portable
 >      > timers in Flink's batch mode.
 >      >
 >      > Looking into this now.
 >      >
 >      > Thanks,
     >      > Max
 >      >
     >      > On 21.11.18 23:56, Alex Amato wrote:
 >      >> Hello, I have noticed
 >      >>
that org.apache.beam.runners.flink.PortableTimersExecutionTest
 >     is very
 >      >> flakey, and repro'd this test timeout on the master
branch in
 >     40/50 runs.
 >      >>
 >      >> I filed a JIRA issue: BEAM-6111
 >      >> <https://issues.apache.org/jira/browse/BEAM-6111>. I
was just
 >      >> wondering if anyone knew why this may be occurring,
and to check if
 >      >> anyone else has been experiencing this.
 >      >>
 >      >> Thanks,
 >      >> Alex
 >



Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-27 Thread Kenneth Knowles
I actually didn't look at this one. I filed a bunch more adjacent flake
bugs. I didn't find your bug but I do see that test flaking at the same
time as the others. FWIW here is the list of flakes and sickbayed tests:
https://issues.apache.org/jira/issues/?filter=12343195

Kenn

On Tue, Nov 27, 2018 at 10:25 AM Alex Amato  wrote:

> +Ken,
>
> Did you happen to look into this test? I heard that you may have been
> looking into this.
>
> On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels  wrote:
>
>> Hi Alex,
>>
>> Thanks for your help! I'm quite used to debugging concurrent/distributed
>> problems. But this one is quite tricky, especially with regards to GRPC
>> threads. I try to provide more information in the following.
>>
>> There are two observations:
>>
>> 1) The problem is specifically related to how the cleanup is performed
>> for the EmbeddedEnvironmentFactory. The environment is shutdown when the
>> SDK Harness exists but the GRPC threads continue to linger for some time
>> and may stall state processing on the next test.
>>
>> If you do _not_ close DefaultJobBundleFactory, which happens during
>> close() or dispose() in the FlinkExecutableStageFunction or
>> ExecutableStageDoFnOperator respectively, the tests run just fine. I ran
>> 1000 test runs without a single failure.
>>
>> The EmbeddedEnvironment uses direct channels which are marked
>> experimental in GRPC. We may have to convert them to regular socket
>> communication.
>>
>> 2) Try setting a conditional breakpoint in GrpcStateService which will
>> never break, e.g. "false". Set it here:
>>
>> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134
>>
>> The tests will never fail. The SDK harness is always shutdown correctly
>> at the end of the test.
>>
>> Thanks,
>> Max
>>
>> On 26.11.18 19:15, Alex Amato wrote:
>> > Thanks Maximilian, let me know if you need any help. Usually I debug
>> > this sort of thing by pausing the IntelliJ debugger to see all the
>> > different threads which are waiting on various conditions. If you find
>> > any insights from that, please post them here and we can try to figure
>> > out the source of the stuckness. Perhaps it may be some concurrency
>> > issue leading to deadlock?
>> >
>> > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels > > <mailto:m...@apache.org>> wrote:
>> >
>> > I couldn't fix it thus far. The issue does not seem to be in the
>> Flink
>> > Runner but in the way the tests utilizes the EMBEDDED environment to
>> > run
>> > multiple portable jobs in a row.
>> >
>> > When it gets stuck it is in RemoteBundle#close and it is
>> independent of
>> > the test type (batch and streaming have different implementations).
>> >
>> > Will give it another look tomorrow.
>> >
>> > Thanks,
>> > Max
>> >
>> > On 22.11.18 13:07, Maximilian Michels wrote:
>> >  > Hi Alex,
>> >  >
>> >  > The test seems to have gotten flaky after we merged support for
>> > portable
>> >  > timers in Flink's batch mode.
>> >  >
>> >  > Looking into this now.
>> >  >
>> >  > Thanks,
>> >  > Max
>> >  >
>> >  > On 21.11.18 23:56, Alex Amato wrote:
>> >  >> Hello, I have noticed
>> >  >> that org.apache.beam.runners.flink.PortableTimersExecutionTest
>> > is very
>> >  >> flakey, and repro'd this test timeout on the master branch in
>> > 40/50 runs.
>> >  >>
>> >  >> I filed a JIRA issue: BEAM-6111
>> >  >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just
>> >  >> wondering if anyone knew why this may be occurring, and to
>> check if
>> >  >> anyone else has been experiencing this.
>> >  >>
>> >  >> Thanks,
>> >  >> Alex
>> >
>>
>


Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-27 Thread Alex Amato
+Ken,

Did you happen to look into this test? I heard that you may have been
looking into this.

On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels  wrote:

> Hi Alex,
>
> Thanks for your help! I'm quite used to debugging concurrent/distributed
> problems. But this one is quite tricky, especially with regards to GRPC
> threads. I try to provide more information in the following.
>
> There are two observations:
>
> 1) The problem is specifically related to how the cleanup is performed
> for the EmbeddedEnvironmentFactory. The environment is shutdown when the
> SDK Harness exists but the GRPC threads continue to linger for some time
> and may stall state processing on the next test.
>
> If you do _not_ close DefaultJobBundleFactory, which happens during
> close() or dispose() in the FlinkExecutableStageFunction or
> ExecutableStageDoFnOperator respectively, the tests run just fine. I ran
> 1000 test runs without a single failure.
>
> The EmbeddedEnvironment uses direct channels which are marked
> experimental in GRPC. We may have to convert them to regular socket
> communication.
>
> 2) Try setting a conditional breakpoint in GrpcStateService which will
> never break, e.g. "false". Set it here:
>
> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134
>
> The tests will never fail. The SDK harness is always shutdown correctly
> at the end of the test.
>
> Thanks,
> Max
>
> On 26.11.18 19:15, Alex Amato wrote:
> > Thanks Maximilian, let me know if you need any help. Usually I debug
> > this sort of thing by pausing the IntelliJ debugger to see all the
> > different threads which are waiting on various conditions. If you find
> > any insights from that, please post them here and we can try to figure
> > out the source of the stuckness. Perhaps it may be some concurrency
> > issue leading to deadlock?
> >
> > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels  > <mailto:m...@apache.org>> wrote:
> >
> > I couldn't fix it thus far. The issue does not seem to be in the
> Flink
> > Runner but in the way the tests utilizes the EMBEDDED environment to
> > run
> > multiple portable jobs in a row.
> >
> > When it gets stuck it is in RemoteBundle#close and it is independent
> of
> > the test type (batch and streaming have different implementations).
> >
> > Will give it another look tomorrow.
> >
> > Thanks,
> > Max
> >
> > On 22.11.18 13:07, Maximilian Michels wrote:
> >  > Hi Alex,
> >  >
> >  > The test seems to have gotten flaky after we merged support for
> > portable
> >  > timers in Flink's batch mode.
> >  >
> >  > Looking into this now.
> >  >
> >  > Thanks,
> >  > Max
> >  >
> >  > On 21.11.18 23:56, Alex Amato wrote:
> >  >> Hello, I have noticed
> >  >> that org.apache.beam.runners.flink.PortableTimersExecutionTest
> > is very
> >  >> flakey, and repro'd this test timeout on the master branch in
> > 40/50 runs.
> >  >>
> >  >> I filed a JIRA issue: BEAM-6111
> >  >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just
> >  >> wondering if anyone knew why this may be occurring, and to check
> if
> >  >> anyone else has been experiencing this.
> >  >>
> >  >> Thanks,
> >  >> Alex
> >
>


Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-26 Thread Maximilian Michels

Hi Alex,

Thanks for your help! I'm quite used to debugging concurrent/distributed 
problems. But this one is quite tricky, especially with regards to GRPC 
threads. I try to provide more information in the following.


There are two observations:

1) The problem is specifically related to how the cleanup is performed 
for the EmbeddedEnvironmentFactory. The environment is shutdown when the 
SDK Harness exists but the GRPC threads continue to linger for some time 
and may stall state processing on the next test.


If you do _not_ close DefaultJobBundleFactory, which happens during 
close() or dispose() in the FlinkExecutableStageFunction or 
ExecutableStageDoFnOperator respectively, the tests run just fine. I ran 
1000 test runs without a single failure.


The EmbeddedEnvironment uses direct channels which are marked 
experimental in GRPC. We may have to convert them to regular socket 
communication.


2) Try setting a conditional breakpoint in GrpcStateService which will 
never break, e.g. "false". Set it here: 
https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134


The tests will never fail. The SDK harness is always shutdown correctly 
at the end of the test.


Thanks,
Max

On 26.11.18 19:15, Alex Amato wrote:
Thanks Maximilian, let me know if you need any help. Usually I debug 
this sort of thing by pausing the IntelliJ debugger to see all the 
different threads which are waiting on various conditions. If you find 
any insights from that, please post them here and we can try to figure 
out the source of the stuckness. Perhaps it may be some concurrency 
issue leading to deadlock?


On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels <mailto:m...@apache.org>> wrote:


I couldn't fix it thus far. The issue does not seem to be in the Flink
Runner but in the way the tests utilizes the EMBEDDED environment to
run
multiple portable jobs in a row.

When it gets stuck it is in RemoteBundle#close and it is independent of
the test type (batch and streaming have different implementations).

Will give it another look tomorrow.

Thanks,
Max

On 22.11.18 13:07, Maximilian Michels wrote:
 > Hi Alex,
 >
 > The test seems to have gotten flaky after we merged support for
portable
 > timers in Flink's batch mode.
 >
 > Looking into this now.
 >
 > Thanks,
 > Max
 >
 > On 21.11.18 23:56, Alex Amato wrote:
 >> Hello, I have noticed
 >> that org.apache.beam.runners.flink.PortableTimersExecutionTest
is very
 >> flakey, and repro'd this test timeout on the master branch in
40/50 runs.
 >>
 >> I filed a JIRA issue: BEAM-6111
 >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just
 >> wondering if anyone knew why this may be occurring, and to check if
 >> anyone else has been experiencing this.
 >>
 >> Thanks,
 >> Alex



Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-26 Thread Alex Amato
Thanks Maximilian, let me know if you need any help. Usually I debug this
sort of thing by pausing the IntelliJ debugger to see all the different
threads which are waiting on various conditions. If you find any insights
from that, please post them here and we can try to figure out the source of
the stuckness. Perhaps it may be some concurrency issue leading to deadlock?

On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels  wrote:

> I couldn't fix it thus far. The issue does not seem to be in the Flink
> Runner but in the way the tests utilizes the EMBEDDED environment to run
> multiple portable jobs in a row.
>
> When it gets stuck it is in RemoteBundle#close and it is independent of
> the test type (batch and streaming have different implementations).
>
> Will give it another look tomorrow.
>
> Thanks,
> Max
>
> On 22.11.18 13:07, Maximilian Michels wrote:
> > Hi Alex,
> >
> > The test seems to have gotten flaky after we merged support for portable
> > timers in Flink's batch mode.
> >
> > Looking into this now.
> >
> > Thanks,
> > Max
> >
> > On 21.11.18 23:56, Alex Amato wrote:
> >> Hello, I have noticed
> >> that org.apache.beam.runners.flink.PortableTimersExecutionTest is very
> >> flakey, and repro'd this test timeout on the master branch in 40/50
> runs.
> >>
> >> I filed a JIRA issue: BEAM-6111
> >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just
> >> wondering if anyone knew why this may be occurring, and to check if
> >> anyone else has been experiencing this.
> >>
> >> Thanks,
> >> Alex
>


Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-22 Thread Maximilian Michels
I couldn't fix it thus far. The issue does not seem to be in the Flink 
Runner but in the way the tests utilizes the EMBEDDED environment to run 
multiple portable jobs in a row.


When it gets stuck it is in RemoteBundle#close and it is independent of 
the test type (batch and streaming have different implementations).


Will give it another look tomorrow.

Thanks,
Max

On 22.11.18 13:07, Maximilian Michels wrote:

Hi Alex,

The test seems to have gotten flaky after we merged support for portable 
timers in Flink's batch mode.


Looking into this now.

Thanks,
Max

On 21.11.18 23:56, Alex Amato wrote:
Hello, I have noticed 
that org.apache.beam.runners.flink.PortableTimersExecutionTest is very 
flakey, and repro'd this test timeout on the master branch in 40/50 runs.


I filed a JIRA issue: BEAM-6111 
<https://issues.apache.org/jira/browse/BEAM-6111>. I was just 
wondering if anyone knew why this may be occurring, and to check if 
anyone else has been experiencing this.


Thanks,
Alex


Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-22 Thread Maximilian Michels

Hi Alex,

The test seems to have gotten flaky after we merged support for portable 
timers in Flink's batch mode.


Looking into this now.

Thanks,
Max

On 21.11.18 23:56, Alex Amato wrote:
Hello, I have noticed 
that org.apache.beam.runners.flink.PortableTimersExecutionTest is very 
flakey, and repro'd this test timeout on the master branch in 40/50 runs.


I filed a JIRA issue: BEAM-6111 
<https://issues.apache.org/jira/browse/BEAM-6111>. I was just wondering 
if anyone knew why this may be occurring, and to check if anyone else 
has been experiencing this.


Thanks,
Alex


org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

2018-11-21 Thread Alex Amato
Hello, I have noticed that
org.apache.beam.runners.flink.PortableTimersExecutionTest
is very flakey, and repro'd this test timeout on the master branch in 40/50
runs.

I filed a JIRA issue: BEAM-6111
<https://issues.apache.org/jira/browse/BEAM-6111>. I was just wondering if
anyone knew why this may be occurring, and to check if anyone else has been
experiencing this.

Thanks,
Alex