Update: I just merged Kiley's https://github.com/apache/beam/pull/14833, in which I tried several "Run Java Precommit" and didn't observe the logging test (BeamFnLoggingServiceTest) failures. Let's see how the builds go.
Kenn, Ismaël, and Kiley, Thank you for the help and follow-up! On Thu, May 13, 2021 at 10:39 AM Tomo Suzuki <[email protected]> wrote: > I'm giving up! Can anyone troubleshoot this gRPC concurrency problem > further? > My current view of the problem (link > <https://github.com/apache/beam/pull/14768#issuecomment-840576342>) is > that "grpc-default-executor" threads stop processing the data. But I cannot > tell why. > > I also raised an question to grpc-java on how best to troubleshoot such > situation > https://github.com/grpc/grpc-java/issues/8174 > > On Wed, May 12, 2021 at 11:29 PM Tomo Suzuki <[email protected]> wrote: > >> Update: still the root cause of is unknown. >> >> From my observation with debug logging and thread dump, >> "grpc-default-executor-XXX" threads disappear when the problematic tests >> become hung. >> More notes: >> https://github.com/apache/beam/pull/14768#issuecomment-840228795 >> >> Interestingly the "grpc-default-executor-XXX" threads reappear in the >> logs when the pause triggers a 5-second timeout set by JUnit. >> >> >> On Tue, May 11, 2021 at 1:12 PM Tomo Suzuki <[email protected]> wrote: >> >>> Thank you for the advice. Yes, the latch not being counted-down is the >>> problem. (my memo: >>> https://github.com/apache/beam/pull/14474#discussion_r619557479 ) I'll >>> need to figure out why withOnError is not called. >>> >>> >>> > Can you repro locally? >>> >>> No, the task succeeds in my environment (./gradlew >>> :runners:google-cloud-dataflow-java:worker:test). >>> >>> >>> On Tue, May 11, 2021 at 12:34 PM Kenneth Knowles <[email protected]> >>> wrote: >>> >>>> I am not sure how much you read the code of the test. So apologies if I >>>> am saying things you already know. The test does something like: >>>> >>>> - start a logging service >>>> - set up some stub clients, each with onError wired up to release a >>>> countdown latch >>>> - send error responses to all three of them (actually it sends the >>>> error in the same task it creates the stub) >>>> - each task waits on the latch >>>> >>>> So if onError does not deliver or does not call to release the >>>> countdown latch, it will hang. I notice in the gist you provide that all >>>> three stub clients are hung awaiting the latch. That is suspicious to me. I >>>> would want to confirm if the flakiness always occurs in a way that hangs >>>> all three. Then there are gRPC workers waiting on empty queues, and the >>>> main test thread waiting for the hung tasks to complete. >>>> >>>> The problem could be something about the test set up. Personally I >>>> would add a ton of logs, or potentially use a debugger, to confirm exactly >>>> the state of things when it hangs. Can you repro locally? I think this same >>>> functionality could be tested in different ways that might remove some of >>>> the variables. For example starting up all the waiting tasks, then sending >>>> all the onError messages that should cause them to terminate. >>>> >>>> Since this is a unit test, adding a timeout to just that method should >>>> save time (but will make it harder to capture stack traces, etc). I've >>>> opened up https://github.com/apache/beam/pull/14781 for that. There >>>> may be a nice way to add a timeout to the executor to capture the hung >>>> stack, but I didn't look for it. >>>> >>>> Kenn >>>> >>>> On Tue, May 11, 2021 at 7:36 AM Tomo Suzuki <[email protected]> wrote: >>>> >>>>> gRPC 1.37.0 showed the same problem: >>>>> BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer >>>>> waits tasks forever, causing timeout in Java precommit. >>>>> >>>>> While I continue my investigation, I appreciate if someone knows the >>>>> cause of the problem, I pasted the thread dump of the Java process when >>>>> the >>>>> test was frozen: >>>>> https://github.com/apache/beam/pull/14768 >>>>> >>>>> If this mystery is never solved, vendoring (a bit old) gRPC 1.32.2 >>>>> without the jboss dependencies is an alternate option, (suggestion by >>>>> Kenn; >>>>> memo >>>>> <https://issues.apache.org/jira/browse/BEAM-11227?focusedCommentId=17318238&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17318238> >>>>> ) >>>>> >>>>> Regards, >>>>> Tomo >>>>> >>>>> >>>>> On Mon, May 10, 2021 at 9:40 AM Tomo Suzuki <[email protected]> >>>>> wrote: >>>>> >>>>>> I was investigating the strange timeout ( >>>>>> https://github.com/apache/beam/pull/14474) but was occupied with >>>>>> something else lately. >>>>>> Let me try the new version today to see any improvements. >>>>>> >>>>>> >>>>>> On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I just saw that gRPC 1.37.1 is out now (and with aarch64 support for >>>>>>> python!) that made me wonder about this, what is the current status of >>>>>>> upgrading the vendored dependency Tomo? >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> We observed the cron job of Java Precommit for the master branch >>>>>>>> started timing out often (not always) since upgrading the gRPC version. >>>>>>>> https://github.com/apache/beam/pull/14466#issuecomment-815343974 >>>>>>>> >>>>>>>> Exchanged messages with Kenn, I reverted to the change; now the >>>>>>>> master branch uses the vendored gRPC 1.26. >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Merged. Let's keep an eye for trouble, and I will incorporate to >>>>>>>>> the release branch. >>>>>>>>> >>>>>>>>> Kenn >>>>>>>>> >>>>>>>>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Regarding troubleshooting on build timeout, it seems that Docker >>>>>>>>>> cache in Jenkins machines might be playing a role. As I run more >>>>>>>>>> "Java >>>>>>>>>> Presubmit", I no longer observe timeouts in the PR. >>>>>>>>>> >>>>>>>>>> Kenn, would you merge the PR? >>>>>>>>>> https://github.com/apache/beam/pull/14295 (all checks green, >>>>>>>>>> including the new Java postcommit checks) >>>>>>>>>> >>>>>>>>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Yes, I agree this might be a good idea. This is not the only >>>>>>>>>>> major issue on the release-2.29.0 branch. >>>>>>>>>>> >>>>>>>>>>> The counter argument is that we will be pulling in all the bugs >>>>>>>>>>> introduced to `master` since the branch cut. >>>>>>>>>>> >>>>>>>>>>> As far as effort goes, I have been mostly focused on burning >>>>>>>>>>> down the bugs so I would not lose much work in the release process. >>>>>>>>>>> >>>>>>>>>>> Kenn >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Precommit is quite unstable in the last days, so worth to check >>>>>>>>>>>> if >>>>>>>>>>>> something is wrong in the CI. >>>>>>>>>>>> >>>>>>>>>>>> I have a question Kenn. Given that cherry picking this might be >>>>>>>>>>>> a bit >>>>>>>>>>>> big as a change can we just reconsider cutting the 2.29.0 >>>>>>>>>>>> branch again >>>>>>>>>>>> after the updated gRPC version use gets merged and mark the >>>>>>>>>>>> issues >>>>>>>>>>>> already fixed for version 2.30.0 to version 2.29.0 ? Seems like >>>>>>>>>>>> an >>>>>>>>>>>> easier upgrade path (and we will get some nice >>>>>>>>>>>> fixes/improvements like >>>>>>>>>>>> official Spark 3 support for free on the release). >>>>>>>>>>>> >>>>>>>>>>>> WDYT? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> > >>>>>>>>>>>> > Update: I observe that Java precommit check is unstable in >>>>>>>>>>>> the PR to upgrade vendored gRPC (compared with an PR with an empty >>>>>>>>>>>> change). >>>>>>>>>>>> There's no constant failures; sometimes it succeeds and other >>>>>>>>>>>> times it >>>>>>>>>>>> faces timeout and flaky test failures. >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> https://github.com/apache/beam/pull/14295#issuecomment-806071087 >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >> >>>>>>>>>>>> >> Thank you for the voting and I see the artifact available in >>>>>>>>>>>> Maven Central. I'll work on the PR to use the published artifact >>>>>>>>>>>> today. >>>>>>>>>>>> >> >>>>>>>>>>>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar >>>>>>>>>>>> >> >>>>>>>>>>>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Update on this: there are some minor issues and then I'll >>>>>>>>>>>> send out the RC. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> I think this is worth blocking 2.29.0 release on, so I will >>>>>>>>>>>> do this first. We are still eliminating other blockers from 2.29.0 >>>>>>>>>>>> anyhow. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Kenn >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> Hi Beam developers, >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> I'm working on upgrading the vendored gRPC 1.36.0 >>>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/BEAM-11227 (PR: >>>>>>>>>>>> https://github.com/apache/beam/pull/14028) >>>>>>>>>>>> >>>> Let me know if you have any questions or concerns. >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> Background: >>>>>>>>>>>> >>>> Exchanged messages with Ismaël in BEAM-11227, it seems >>>>>>>>>>>> that it the ticket created by some automation is false positive, >>>>>>>>>>>> but it's >>>>>>>>>>>> nice to use an artifact without being marked with CVE. >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> Kenn offered to work as the release manager (as in >>>>>>>>>>>> https://s.apache.org/beam-release-vendored-artifacts) of the >>>>>>>>>>>> vendored artifact. >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> -- >>>>>>>>>>>> >>>> Regards, >>>>>>>>>>>> >>>> Tomo >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> -- >>>>>>>>>>>> >> Regards, >>>>>>>>>>>> >> Tomo >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > -- >>>>>>>>>>>> > Regards, >>>>>>>>>>>> > Tomo >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards, >>>>>>>>>> Tomo >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> Tomo >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Tomo >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Tomo >>>>> >>>> >>> >>> -- >>> Regards, >>> Tomo >>> >> >> >> -- >> Regards, >> Tomo >> > > > -- > Regards, > Tomo > -- Regards, Tomo
