becketqin opened a new issue, #25485:
URL: https://github.com/apache/beam/issues/25485

   ### What happened?
   
   When the validation test for Flink runner runs, 
`GroupByKeyTest$BasicTests$.testAfterProcessingTimeContinuationTriggerUsingState()`
 fails from time to time. I checked the test but had some issue understanding 
the expected behavior.
   
   As far as I understand, the expected simplified input and output sequence of 
the PTransforms are following:
   1. Source
     a. emit `0`
     b. emit `Watermark.MAX_VALUE`
   2. WithKeys
     a. [in] `0`
     b. [out] `KV{"dummy key", 0}`
     c. [in] `Watermark.MAX_VALUE`
     d. [out] `Watermark.MAX_VALUE`
   3. ParDo
     a. [in] `KV{"dummy key", 0}`
     b. [out] `42`
     c. [in] `Watermark.MAX_VALUE`
     d. [out] `Watermark.MAX_VALUE` (which has a delay of 500 ms after it 
receives the `Watermark.MAX_VALUE`)
   4. First Sum
     a. [in] `42`
     b. [out] `{42, Timing=EARLY}` (emitted 1 ms after it receives the input of 
`42`)
     c. [in] `Watermark.MAX_VALUE` (roughly 500 ms after it emits `{42, 
Timing=EARLY}`
     d. [out] `{42, Timing=ON_TIME}`
     e. [out] `Watermark.MAX_VALUE`
   5. Second Sum
     a. [in] `{42, Timing=EARLY}`
     b. [out] `(42, Timing=EARLY)` (emitted 1 ms after it receives the input of 
`{42, Timing=EARLY}`
     c. [in] `{42, Timing=ON_TIME}`
     d. [out] `{84, Timing=EARLY}` (emitted 1 ms after it receives the input of 
`{42, Timing=ON_TIME}`, if the `Watermark.MAX_VALUE` hasn't arrived by then. 
Otherwise, this output will be skipped.)
     e. [in] `Watermark.MAX_VALUE`
     f. [out] `{84, Timing=ON_TIME}`
   
   I am wondering if the race condition between 5.d and 5.e causes the test to 
be flaky.  
   - When 5.e happens before 5.d, the test will pass because there will only be 
one early fire which is 42. 
   - When 5.e happens after 5.d, the test will fail because both 42 and 84 are 
showing up as early fires.
   
   However, I tried to tweak the delay of 
`AfterProcessingTime.pastFirstElementInPane()` a little bit, the test seems 
still failing ~50% of the runs. So I am not quite sure if my theory above makes 
sense.
   
   @kennknowles It looks like this test was added by you. Could you help check? 
Thanks.
   
   ### Issue Failure
   
   Failure: Test is flaky
   
   ### Issue Priority
   
   Priority: 2 (backlog / disabled test but we think the product is healthy)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [X] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to