I'm giving up! Can anyone troubleshoot this gRPC concurrency problem further? My current view of the problem (link <https://github.com/apache/beam/pull/14768#issuecomment-840576342>) is that "grpc-default-executor" threads stop processing the data. But I cannot tell why.
I also raised an question to grpc-java on how best to troubleshoot such situation https://github.com/grpc/grpc-java/issues/8174 On Wed, May 12, 2021 at 11:29 PM Tomo Suzuki <[email protected]> wrote: > Update: still the root cause of is unknown. > > From my observation with debug logging and thread dump, > "grpc-default-executor-XXX" threads disappear when the problematic tests > become hung. > More notes: > https://github.com/apache/beam/pull/14768#issuecomment-840228795 > > Interestingly the "grpc-default-executor-XXX" threads reappear in the logs > when the pause triggers a 5-second timeout set by JUnit. > > > On Tue, May 11, 2021 at 1:12 PM Tomo Suzuki <[email protected]> wrote: > >> Thank you for the advice. Yes, the latch not being counted-down is the >> problem. (my memo: >> https://github.com/apache/beam/pull/14474#discussion_r619557479 ) I'll >> need to figure out why withOnError is not called. >> >> >> > Can you repro locally? >> >> No, the task succeeds in my environment (./gradlew >> :runners:google-cloud-dataflow-java:worker:test). >> >> >> On Tue, May 11, 2021 at 12:34 PM Kenneth Knowles <[email protected]> wrote: >> >>> I am not sure how much you read the code of the test. So apologies if I >>> am saying things you already know. The test does something like: >>> >>> - start a logging service >>> - set up some stub clients, each with onError wired up to release a >>> countdown latch >>> - send error responses to all three of them (actually it sends the >>> error in the same task it creates the stub) >>> - each task waits on the latch >>> >>> So if onError does not deliver or does not call to release the countdown >>> latch, it will hang. I notice in the gist you provide that all three stub >>> clients are hung awaiting the latch. That is suspicious to me. I would want >>> to confirm if the flakiness always occurs in a way that hangs all three. >>> Then there are gRPC workers waiting on empty queues, and the main test >>> thread waiting for the hung tasks to complete. >>> >>> The problem could be something about the test set up. Personally I would >>> add a ton of logs, or potentially use a debugger, to confirm exactly the >>> state of things when it hangs. Can you repro locally? I think this same >>> functionality could be tested in different ways that might remove some of >>> the variables. For example starting up all the waiting tasks, then sending >>> all the onError messages that should cause them to terminate. >>> >>> Since this is a unit test, adding a timeout to just that method should >>> save time (but will make it harder to capture stack traces, etc). I've >>> opened up https://github.com/apache/beam/pull/14781 for that. There may >>> be a nice way to add a timeout to the executor to capture the hung stack, >>> but I didn't look for it. >>> >>> Kenn >>> >>> On Tue, May 11, 2021 at 7:36 AM Tomo Suzuki <[email protected]> wrote: >>> >>>> gRPC 1.37.0 showed the same problem: >>>> BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer >>>> waits tasks forever, causing timeout in Java precommit. >>>> >>>> While I continue my investigation, I appreciate if someone knows the >>>> cause of the problem, I pasted the thread dump of the Java process when the >>>> test was frozen: >>>> https://github.com/apache/beam/pull/14768 >>>> >>>> If this mystery is never solved, vendoring (a bit old) gRPC 1.32.2 >>>> without the jboss dependencies is an alternate option, (suggestion by Kenn; >>>> memo >>>> <https://issues.apache.org/jira/browse/BEAM-11227?focusedCommentId=17318238&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17318238> >>>> ) >>>> >>>> Regards, >>>> Tomo >>>> >>>> >>>> On Mon, May 10, 2021 at 9:40 AM Tomo Suzuki <[email protected]> wrote: >>>> >>>>> I was investigating the strange timeout ( >>>>> https://github.com/apache/beam/pull/14474) but was occupied with >>>>> something else lately. >>>>> Let me try the new version today to see any improvements. >>>>> >>>>> >>>>> On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía <[email protected]> >>>>> wrote: >>>>> >>>>>> I just saw that gRPC 1.37.1 is out now (and with aarch64 support for >>>>>> python!) that made me wonder about this, what is the current status of >>>>>> upgrading the vendored dependency Tomo? >>>>>> >>>>>> >>>>>> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> We observed the cron job of Java Precommit for the master branch >>>>>>> started timing out often (not always) since upgrading the gRPC version. >>>>>>> https://github.com/apache/beam/pull/14466#issuecomment-815343974 >>>>>>> >>>>>>> Exchanged messages with Kenn, I reverted to the change; now the >>>>>>> master branch uses the vendored gRPC 1.26. >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Merged. Let's keep an eye for trouble, and I will incorporate to >>>>>>>> the release branch. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Regarding troubleshooting on build timeout, it seems that Docker >>>>>>>>> cache in Jenkins machines might be playing a role. As I run more "Java >>>>>>>>> Presubmit", I no longer observe timeouts in the PR. >>>>>>>>> >>>>>>>>> Kenn, would you merge the PR? >>>>>>>>> https://github.com/apache/beam/pull/14295 (all checks green, >>>>>>>>> including the new Java postcommit checks) >>>>>>>>> >>>>>>>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yes, I agree this might be a good idea. This is not the only >>>>>>>>>> major issue on the release-2.29.0 branch. >>>>>>>>>> >>>>>>>>>> The counter argument is that we will be pulling in all the bugs >>>>>>>>>> introduced to `master` since the branch cut. >>>>>>>>>> >>>>>>>>>> As far as effort goes, I have been mostly focused on burning down >>>>>>>>>> the bugs so I would not lose much work in the release process. >>>>>>>>>> >>>>>>>>>> Kenn >>>>>>>>>> >>>>>>>>>> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Precommit is quite unstable in the last days, so worth to check >>>>>>>>>>> if >>>>>>>>>>> something is wrong in the CI. >>>>>>>>>>> >>>>>>>>>>> I have a question Kenn. Given that cherry picking this might be >>>>>>>>>>> a bit >>>>>>>>>>> big as a change can we just reconsider cutting the 2.29.0 branch >>>>>>>>>>> again >>>>>>>>>>> after the updated gRPC version use gets merged and mark the >>>>>>>>>>> issues >>>>>>>>>>> already fixed for version 2.30.0 to version 2.29.0 ? Seems like >>>>>>>>>>> an >>>>>>>>>>> easier upgrade path (and we will get some nice >>>>>>>>>>> fixes/improvements like >>>>>>>>>>> official Spark 3 support for free on the release). >>>>>>>>>>> >>>>>>>>>>> WDYT? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> > >>>>>>>>>>> > Update: I observe that Java precommit check is unstable in the >>>>>>>>>>> PR to upgrade vendored gRPC (compared with an PR with an empty >>>>>>>>>>> change). >>>>>>>>>>> There's no constant failures; sometimes it succeeds and other times >>>>>>>>>>> it >>>>>>>>>>> faces timeout and flaky test failures. >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> https://github.com/apache/beam/pull/14295#issuecomment-806071087 >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >> >>>>>>>>>>> >> Thank you for the voting and I see the artifact available in >>>>>>>>>>> Maven Central. I'll work on the PR to use the published artifact >>>>>>>>>>> today. >>>>>>>>>>> >> >>>>>>>>>>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar >>>>>>>>>>> >> >>>>>>>>>>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>> >>>>>>>>>>> >>> Update on this: there are some minor issues and then I'll >>>>>>>>>>> send out the RC. >>>>>>>>>>> >>> >>>>>>>>>>> >>> I think this is worth blocking 2.29.0 release on, so I will >>>>>>>>>>> do this first. We are still eliminating other blockers from 2.29.0 >>>>>>>>>>> anyhow. >>>>>>>>>>> >>> >>>>>>>>>>> >>> Kenn >>>>>>>>>>> >>> >>>>>>>>>>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> Hi Beam developers, >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> I'm working on upgrading the vendored gRPC 1.36.0 >>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/BEAM-11227 (PR: >>>>>>>>>>> https://github.com/apache/beam/pull/14028) >>>>>>>>>>> >>>> Let me know if you have any questions or concerns. >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> Background: >>>>>>>>>>> >>>> Exchanged messages with Ismaël in BEAM-11227, it seems that >>>>>>>>>>> it the ticket created by some automation is false positive, but >>>>>>>>>>> it's nice >>>>>>>>>>> to use an artifact without being marked with CVE. >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> Kenn offered to work as the release manager (as in >>>>>>>>>>> https://s.apache.org/beam-release-vendored-artifacts) of the >>>>>>>>>>> vendored artifact. >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> -- >>>>>>>>>>> >>>> Regards, >>>>>>>>>>> >>>> Tomo >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> -- >>>>>>>>>>> >> Regards, >>>>>>>>>>> >> Tomo >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > -- >>>>>>>>>>> > Regards, >>>>>>>>>>> > Tomo >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards, >>>>>>>>> Tomo >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Tomo >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Tomo >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Tomo >>>> >>> >> >> -- >> Regards, >> Tomo >> > > > -- > Regards, > Tomo > -- Regards, Tomo
