[I] [Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool [beam]

via GitHub Wed, 06 Dec 2023 13:37:20 -0800


Abacn opened a new issue, #29644:
URL: https://github.com/apache/beam/issues/29644


   ### What happened?
   
   This is not an SDK bug. Internal ticket will also be filed.
   
   It is noticed that some random test failing on 
https://github.com/apache/beam/actions/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow_V2_Streaming.yml?query=event%3Aschedule
   
   For example,
   
   https://github.com/apache/beam/runs/19288524813 with
   ```
   java.lang.RuntimeException: Dataflow job 
2023-12-04_17_25_11-8098838800222529628 terminated in state RUNNING but did not 
return a failure reason.
        at 
org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:141)
        at 
org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:93)
        at 
org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:55)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:321)
   ```
   
   Taking a look at worker log (`2023-12-04_01_06_49-4587814450665214461`)
   ```
   DEBUG 2023-12-04T09:10:15.680161433Z Cleaning up.
   DEBUG 2023-12-04T09:10:15.758333246Z Starting worker pool teardown.
   INFO 2023-12-04T09:10:15.782743429Z Stopping worker pool...
   DEBUG 2023-12-04T09:10:15.860076821Z Starting worker pool teardown.
   INFO 2023-12-04T09:10:15.884898782Z Stopping worker pool...
   INFO 2023-12-04T09:10:22.584818Z *** SIGTERM received by PID 14 (TID 14) on 
cpu 0 from PID 0; stack trace: ***
   INFO 2023-12-04T09:10:22.584901Z PC: @ 0x55e6a3671f96 (unknown) (unknown)
   ...
   INFO 2023-12-04T09:10:23.448826420Z ****************************************
   DEBUG 2023-12-04T09:15:59.183629917Z Cleaning up.
   DEBUG 2023-12-04T09:16:00.002974729Z Autoscaling: Reduced the number of 
workers to 0 based on low average worker CPU utilization, and the pipeline 
having sufficiently low backlog and keeping up with input rate.
   DEBUG 2023-12-04T09:16:05.045251507Z Starting worker pool teardown.
   INFO 2023-12-04T09:16:10.060122904Z Stopping worker pool...
   DEBUG 2023-12-04T09:16:10.114936278Z Starting worker pool teardown.
   INFO 2023-12-04T09:16:10.144000861Z Stopping worker pool...
   INFO 2023-12-04T09:16:10.179066170Z Worker pool stopped.
   DEBUG 2023-12-04T09:16:10.204084191Z Tearing down pending resources...
   INFO 2023-12-04T09:16:10.792660720Z Your project already contains 100 
Dataflow-created metric descriptors...
   ```
   
   Occasionally it took 6 minutes to stop the worker pool, where 
TestPipeline.run() is still blocking.
   
   Another occurrence `2023-12-04_17_25_11-8098838800222529628`
   ```
   INFO 2023-12-05T01:27:58.530670Z Successfully sampled resources
   DEBUG 2023-12-05T01:28:21.709062648Z Cleaning up.
   DEBUG 2023-12-05T01:28:21.755866390Z Starting worker pool teardown.
   INFO 2023-12-05T01:28:21.778534382Z Stopping worker pool...
   DEBUG 2023-12-05T01:28:21.806963604Z Starting worker pool teardown.
   INFO 2023-12-05T01:28:21.827367725Z Stopping worker pool...
   ERROR 2023-12-05T01:28:28.218612Z *** SIGTERM received by PID 14 (TID 14) on 
cpu 0 from PID 0; stack trace: ***
   ERROR 2023-12-05T01:28:28.219090Z PC: @ 0x5621e4e71f96 (unknown) (unknown) @ 
0x5621e6f3de89 1344 (unknown) @ 0x7f9168c099a0 631864672 (unknown) @ 
0x5621e4e71f96 96 (unknown) @ 0x5621e4cafcca 208 (unknown) @ 0x5621e5df4d30 64 
(unknown) @ 0x5621e52547b7 1664 (unknown) @ 0x7f9168a77bbd 208 
__libc_start_main @ 0x5621e4c13599 (unknown) (unknown)
   WARNING 2023-12-05T01:28:28.223847Z --- CPU registers: --- r8=0 
r9=ffffffffffffffff r10=0 r11=246 r12=7f91689c9100 r13=7f91689c9180 r14=881 
r15=ffffffffffffffff rdi=7f91689c9180 rsi=189 rbp=7ffd8e451870 rbx=7f91689c9100 
rdx=0 rax=fffffffffffffffc rcx=5621e4e71f96 rsp=7ffd8e451820 rip=5621e4e71f96 
efl=246 cgf=2b000000000033 err=0 trp=0 msk=0 cr2=0
   ...
   INFO 2023-12-05T01:28:28.674663925Z *********************
   DEFAULT 2023-12-05T01:28:28.674675645Z 
{"logging.googleapis.com/diagnostic":{…}}
   DEBUG 2023-12-05T01:33:47.725042327Z Autoscaling: Reduced the number of 
workers to 0 based on low average worker CPU utilization, and the pipeline 
having sufficiently low backlog and keeping up with input rate.
   DEBUG 2023-12-05T01:33:47.946791112Z Cleaning up.
   DEBUG 2023-12-05T01:33:48.026216727Z Starting worker pool teardown.
   INFO 2023-12-05T01:33:48.042428706Z Stopping worker pool...
   DEBUG 2023-12-05T01:33:48.064501675Z Starting worker pool teardown.
   INFO 2023-12-05T01:33:48.086649213Z Stopping worker pool...
   INFO 2023-12-05T01:33:48.118769551Z Worker pool stopped.
   DEBUG 2023-12-05T01:33:48.138574980Z Tearing down pending resources...
   INFO 2023-12-05T01:33:49.164659964Z Your project already contains 100 
Dataflow-created metric descriptors...
   ```
   
   also took 6 minutes.
   
   It is strange to see the last log is "Your project already contains 100 
Dataflow-created metric descriptors." This should be seen in setup, usually.
   
   For comparison, a normally shutdown worker has this kind of log:
   ```
   DEBUG 2023-12-06T21:04:26.941513220Z Cleaning up.
   DEBUG 2023-12-06T21:04:26.980247165Z Starting worker pool teardown.
   INFO 2023-12-06T21:04:27.007520790Z Stopping worker pool...
   DEBUG 2023-12-06T21:04:27.036155756Z Starting worker pool teardown.
   INFO 2023-12-06T21:04:27.057789052Z Stopping worker pool...
   INFO 2023-12-06T21:04:34.289906Z *** SIGTERM received by PID 13 (TID 13) on 
cpu 1 from PID 0; stack trace: ***
   ERROR 2023-12-06T21:04:34.293400Z *** SIGTERM received by PID 13 (TID 13) on 
cpu 1 from PID 0; stack trace: ***
   DEFAULT 2023-12-06T21:04:34.718670731Z 
{"logging.googleapis.com/diagnostic":{…}}
   DEBUG 2023-12-06T21:06:48.144509841Z Autoscaling: Reduced the number of 
workers to 0 based on low average worker CPU utilization, and the pipeline 
having sufficiently low backlog and keeping up with input rate.
   INFO 2023-12-06T21:06:48.181400397Z Worker pool stopped.
   DEBUG 2023-12-06T21:06:48.208832034Z Tearing down pending resources...
   ```
   
   Likely some racing condition in stop worker pool causing inconsistent 
teardown time
   
   ### Issue Failure
   
   Failure: Test is flaky
   
   ### Issue Priority
   
   Priority: 2 (backlog / disabled test but we think the product is healthy)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Failing Test]: Dataflow V2 VR Streaming Flakiness due to timeout close worker pool [beam]

Reply via email to