I am not sure how much you read the code of the test. So apologies if I am saying things you already know. The test does something like:
- start a logging service - set up some stub clients, each with onError wired up to release a countdown latch - send error responses to all three of them (actually it sends the error in the same task it creates the stub) - each task waits on the latch So if onError does not deliver or does not call to release the countdown latch, it will hang. I notice in the gist you provide that all three stub clients are hung awaiting the latch. That is suspicious to me. I would want to confirm if the flakiness always occurs in a way that hangs all three. Then there are gRPC workers waiting on empty queues, and the main test thread waiting for the hung tasks to complete. The problem could be something about the test set up. Personally I would add a ton of logs, or potentially use a debugger, to confirm exactly the state of things when it hangs. Can you repro locally? I think this same functionality could be tested in different ways that might remove some of the variables. For example starting up all the waiting tasks, then sending all the onError messages that should cause them to terminate. Since this is a unit test, adding a timeout to just that method should save time (but will make it harder to capture stack traces, etc). I've opened up https://github.com/apache/beam/pull/14781 for that. There may be a nice way to add a timeout to the executor to capture the hung stack, but I didn't look for it. Kenn On Tue, May 11, 2021 at 7:36 AM Tomo Suzuki <[email protected]> wrote: > gRPC 1.37.0 showed the same problem: > BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer > waits tasks forever, causing timeout in Java precommit. > > While I continue my investigation, I appreciate if someone knows the cause > of the problem, I pasted the thread dump of the Java process when the test > was frozen: > https://github.com/apache/beam/pull/14768 > > If this mystery is never solved, vendoring (a bit old) gRPC 1.32.2 without > the jboss dependencies is an alternate option, (suggestion by Kenn; memo > <https://issues.apache.org/jira/browse/BEAM-11227?focusedCommentId=17318238&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17318238> > ) > > Regards, > Tomo > > > On Mon, May 10, 2021 at 9:40 AM Tomo Suzuki <[email protected]> wrote: > >> I was investigating the strange timeout ( >> https://github.com/apache/beam/pull/14474) but was occupied with >> something else lately. >> Let me try the new version today to see any improvements. >> >> >> On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía <[email protected]> wrote: >> >>> I just saw that gRPC 1.37.1 is out now (and with aarch64 support for >>> python!) that made me wonder about this, what is the current status of >>> upgrading the vendored dependency Tomo? >>> >>> >>> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki <[email protected]> wrote: >>> >>>> We observed the cron job of Java Precommit for the master branch >>>> started timing out often (not always) since upgrading the gRPC version. >>>> https://github.com/apache/beam/pull/14466#issuecomment-815343974 >>>> >>>> Exchanged messages with Kenn, I reverted to the change; now the master >>>> branch uses the vendored gRPC 1.26. >>>> >>>> >>>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles <[email protected]> >>>> wrote: >>>> >>>>> Merged. Let's keep an eye for trouble, and I will incorporate to the >>>>> release branch. >>>>> >>>>> Kenn >>>>> >>>>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki <[email protected]> >>>>> wrote: >>>>> >>>>>> Regarding troubleshooting on build timeout, it seems that Docker >>>>>> cache in Jenkins machines might be playing a role. As I run more "Java >>>>>> Presubmit", I no longer observe timeouts in the PR. >>>>>> >>>>>> Kenn, would you merge the PR? >>>>>> https://github.com/apache/beam/pull/14295 (all checks green, >>>>>> including the new Java postcommit checks) >>>>>> >>>>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Yes, I agree this might be a good idea. This is not the only major >>>>>>> issue on the release-2.29.0 branch. >>>>>>> >>>>>>> The counter argument is that we will be pulling in all the bugs >>>>>>> introduced to `master` since the branch cut. >>>>>>> >>>>>>> As far as effort goes, I have been mostly focused on burning down >>>>>>> the bugs so I would not lose much work in the release process. >>>>>>> >>>>>>> Kenn >>>>>>> >>>>>>> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Precommit is quite unstable in the last days, so worth to check if >>>>>>>> something is wrong in the CI. >>>>>>>> >>>>>>>> I have a question Kenn. Given that cherry picking this might be a >>>>>>>> bit >>>>>>>> big as a change can we just reconsider cutting the 2.29.0 branch >>>>>>>> again >>>>>>>> after the updated gRPC version use gets merged and mark the issues >>>>>>>> already fixed for version 2.30.0 to version 2.29.0 ? Seems like an >>>>>>>> easier upgrade path (and we will get some nice fixes/improvements >>>>>>>> like >>>>>>>> official Spark 3 support for free on the release). >>>>>>>> >>>>>>>> WDYT? >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki <[email protected]> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > Update: I observe that Java precommit check is unstable in the PR >>>>>>>> to upgrade vendored gRPC (compared with an PR with an empty change). >>>>>>>> There's no constant failures; sometimes it succeeds and other times it >>>>>>>> faces timeout and flaky test failures. >>>>>>>> > >>>>>>>> > https://github.com/apache/beam/pull/14295#issuecomment-806071087 >>>>>>>> > >>>>>>>> > >>>>>>>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki <[email protected]> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >> Thank you for the voting and I see the artifact available in >>>>>>>> Maven Central. I'll work on the PR to use the published artifact today. >>>>>>>> >> >>>>>>>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar >>>>>>>> >> >>>>>>>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles <[email protected]> >>>>>>>> wrote: >>>>>>>> >>> >>>>>>>> >>> Update on this: there are some minor issues and then I'll send >>>>>>>> out the RC. >>>>>>>> >>> >>>>>>>> >>> I think this is worth blocking 2.29.0 release on, so I will do >>>>>>>> this first. We are still eliminating other blockers from 2.29.0 anyhow. >>>>>>>> >>> >>>>>>>> >>> Kenn >>>>>>>> >>> >>>>>>>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>> >>>>>>>> >>>> Hi Beam developers, >>>>>>>> >>>> >>>>>>>> >>>> I'm working on upgrading the vendored gRPC 1.36.0 >>>>>>>> >>>> https://issues.apache.org/jira/browse/BEAM-11227 (PR: >>>>>>>> https://github.com/apache/beam/pull/14028) >>>>>>>> >>>> Let me know if you have any questions or concerns. >>>>>>>> >>>> >>>>>>>> >>>> Background: >>>>>>>> >>>> Exchanged messages with Ismaël in BEAM-11227, it seems that it >>>>>>>> the ticket created by some automation is false positive, but it's nice >>>>>>>> to >>>>>>>> use an artifact without being marked with CVE. >>>>>>>> >>>> >>>>>>>> >>>> Kenn offered to work as the release manager (as in >>>>>>>> https://s.apache.org/beam-release-vendored-artifacts) of the >>>>>>>> vendored artifact. >>>>>>>> >>>> >>>>>>>> >>>> -- >>>>>>>> >>>> Regards, >>>>>>>> >>>> Tomo >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> -- >>>>>>>> >> Regards, >>>>>>>> >> Tomo >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Regards, >>>>>>>> > Tomo >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Tomo >>>>>> >>>>> >>>> >>>> -- >>>> Regards, >>>> Tomo >>>> >>> >> >> -- >> Regards, >> Tomo >> > > > -- > Regards, > Tomo >
