Update: still the root cause of is unknown. >From my observation with debug logging and thread dump, "grpc-default-executor-XXX" threads disappear when the problematic tests become hung. More notes: https://github.com/apache/beam/pull/14768#issuecomment-840228795
Interestingly the "grpc-default-executor-XXX" threads reappear in the logs when the pause triggers a 5-second timeout set by JUnit. On Tue, May 11, 2021 at 1:12 PM Tomo Suzuki <[email protected]> wrote: > Thank you for the advice. Yes, the latch not being counted-down is the > problem. (my memo: > https://github.com/apache/beam/pull/14474#discussion_r619557479 ) I'll > need to figure out why withOnError is not called. > > > > Can you repro locally? > > No, the task succeeds in my environment (./gradlew > :runners:google-cloud-dataflow-java:worker:test). > > > On Tue, May 11, 2021 at 12:34 PM Kenneth Knowles <[email protected]> wrote: > >> I am not sure how much you read the code of the test. So apologies if I >> am saying things you already know. The test does something like: >> >> - start a logging service >> - set up some stub clients, each with onError wired up to release a >> countdown latch >> - send error responses to all three of them (actually it sends the error >> in the same task it creates the stub) >> - each task waits on the latch >> >> So if onError does not deliver or does not call to release the countdown >> latch, it will hang. I notice in the gist you provide that all three stub >> clients are hung awaiting the latch. That is suspicious to me. I would want >> to confirm if the flakiness always occurs in a way that hangs all three. >> Then there are gRPC workers waiting on empty queues, and the main test >> thread waiting for the hung tasks to complete. >> >> The problem could be something about the test set up. Personally I would >> add a ton of logs, or potentially use a debugger, to confirm exactly the >> state of things when it hangs. Can you repro locally? I think this same >> functionality could be tested in different ways that might remove some of >> the variables. For example starting up all the waiting tasks, then sending >> all the onError messages that should cause them to terminate. >> >> Since this is a unit test, adding a timeout to just that method should >> save time (but will make it harder to capture stack traces, etc). I've >> opened up https://github.com/apache/beam/pull/14781 for that. There may >> be a nice way to add a timeout to the executor to capture the hung stack, >> but I didn't look for it. >> >> Kenn >> >> On Tue, May 11, 2021 at 7:36 AM Tomo Suzuki <[email protected]> wrote: >> >>> gRPC 1.37.0 showed the same problem: >>> BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer >>> waits tasks forever, causing timeout in Java precommit. >>> >>> While I continue my investigation, I appreciate if someone knows the >>> cause of the problem, I pasted the thread dump of the Java process when the >>> test was frozen: >>> https://github.com/apache/beam/pull/14768 >>> >>> If this mystery is never solved, vendoring (a bit old) gRPC 1.32.2 >>> without the jboss dependencies is an alternate option, (suggestion by Kenn; >>> memo >>> <https://issues.apache.org/jira/browse/BEAM-11227?focusedCommentId=17318238&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17318238> >>> ) >>> >>> Regards, >>> Tomo >>> >>> >>> On Mon, May 10, 2021 at 9:40 AM Tomo Suzuki <[email protected]> wrote: >>> >>>> I was investigating the strange timeout ( >>>> https://github.com/apache/beam/pull/14474) but was occupied with >>>> something else lately. >>>> Let me try the new version today to see any improvements. >>>> >>>> >>>> On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía <[email protected]> wrote: >>>> >>>>> I just saw that gRPC 1.37.1 is out now (and with aarch64 support for >>>>> python!) that made me wonder about this, what is the current status of >>>>> upgrading the vendored dependency Tomo? >>>>> >>>>> >>>>> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki <[email protected]> wrote: >>>>> >>>>>> We observed the cron job of Java Precommit for the master branch >>>>>> started timing out often (not always) since upgrading the gRPC version. >>>>>> https://github.com/apache/beam/pull/14466#issuecomment-815343974 >>>>>> >>>>>> Exchanged messages with Kenn, I reverted to the change; now the >>>>>> master branch uses the vendored gRPC 1.26. >>>>>> >>>>>> >>>>>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Merged. Let's keep an eye for trouble, and I will incorporate to the >>>>>>> release branch. >>>>>>> >>>>>>> Kenn >>>>>>> >>>>>>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Regarding troubleshooting on build timeout, it seems that Docker >>>>>>>> cache in Jenkins machines might be playing a role. As I run more "Java >>>>>>>> Presubmit", I no longer observe timeouts in the PR. >>>>>>>> >>>>>>>> Kenn, would you merge the PR? >>>>>>>> https://github.com/apache/beam/pull/14295 (all checks green, >>>>>>>> including the new Java postcommit checks) >>>>>>>> >>>>>>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes, I agree this might be a good idea. This is not the only major >>>>>>>>> issue on the release-2.29.0 branch. >>>>>>>>> >>>>>>>>> The counter argument is that we will be pulling in all the bugs >>>>>>>>> introduced to `master` since the branch cut. >>>>>>>>> >>>>>>>>> As far as effort goes, I have been mostly focused on burning down >>>>>>>>> the bugs so I would not lose much work in the release process. >>>>>>>>> >>>>>>>>> Kenn >>>>>>>>> >>>>>>>>> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Precommit is quite unstable in the last days, so worth to check if >>>>>>>>>> something is wrong in the CI. >>>>>>>>>> >>>>>>>>>> I have a question Kenn. Given that cherry picking this might be a >>>>>>>>>> bit >>>>>>>>>> big as a change can we just reconsider cutting the 2.29.0 branch >>>>>>>>>> again >>>>>>>>>> after the updated gRPC version use gets merged and mark the issues >>>>>>>>>> already fixed for version 2.30.0 to version 2.29.0 ? Seems like an >>>>>>>>>> easier upgrade path (and we will get some nice fixes/improvements >>>>>>>>>> like >>>>>>>>>> official Spark 3 support for free on the release). >>>>>>>>>> >>>>>>>>>> WDYT? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> > >>>>>>>>>> > Update: I observe that Java precommit check is unstable in the >>>>>>>>>> PR to upgrade vendored gRPC (compared with an PR with an empty >>>>>>>>>> change). >>>>>>>>>> There's no constant failures; sometimes it succeeds and other times >>>>>>>>>> it >>>>>>>>>> faces timeout and flaky test failures. >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> https://github.com/apache/beam/pull/14295#issuecomment-806071087 >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >> >>>>>>>>>> >> Thank you for the voting and I see the artifact available in >>>>>>>>>> Maven Central. I'll work on the PR to use the published artifact >>>>>>>>>> today. >>>>>>>>>> >> >>>>>>>>>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar >>>>>>>>>> >> >>>>>>>>>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>> >>>>>>>>>> >>> Update on this: there are some minor issues and then I'll >>>>>>>>>> send out the RC. >>>>>>>>>> >>> >>>>>>>>>> >>> I think this is worth blocking 2.29.0 release on, so I will >>>>>>>>>> do this first. We are still eliminating other blockers from 2.29.0 >>>>>>>>>> anyhow. >>>>>>>>>> >>> >>>>>>>>>> >>> Kenn >>>>>>>>>> >>> >>>>>>>>>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>> >>>>>>>>>> >>>> Hi Beam developers, >>>>>>>>>> >>>> >>>>>>>>>> >>>> I'm working on upgrading the vendored gRPC 1.36.0 >>>>>>>>>> >>>> https://issues.apache.org/jira/browse/BEAM-11227 (PR: >>>>>>>>>> https://github.com/apache/beam/pull/14028) >>>>>>>>>> >>>> Let me know if you have any questions or concerns. >>>>>>>>>> >>>> >>>>>>>>>> >>>> Background: >>>>>>>>>> >>>> Exchanged messages with Ismaël in BEAM-11227, it seems that >>>>>>>>>> it the ticket created by some automation is false positive, but it's >>>>>>>>>> nice >>>>>>>>>> to use an artifact without being marked with CVE. >>>>>>>>>> >>>> >>>>>>>>>> >>>> Kenn offered to work as the release manager (as in >>>>>>>>>> https://s.apache.org/beam-release-vendored-artifacts) of the >>>>>>>>>> vendored artifact. >>>>>>>>>> >>>> >>>>>>>>>> >>>> -- >>>>>>>>>> >>>> Regards, >>>>>>>>>> >>>> Tomo >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> -- >>>>>>>>>> >> Regards, >>>>>>>>>> >> Tomo >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > -- >>>>>>>>>> > Regards, >>>>>>>>>> > Tomo >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> Tomo >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Tomo >>>>>> >>>>> >>>> >>>> -- >>>> Regards, >>>> Tomo >>>> >>> >>> >>> -- >>> Regards, >>> Tomo >>> >> > > -- > Regards, > Tomo > -- Regards, Tomo
