[ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-30562:
----------------------------------------
    Description: 
(Apologies for the speculative and somewhat vague ticket, but I wanted to raise 
this while I am investigating to see if anyone has suggestions to help me 
narrow down the problem.)

We are encountering an issue where our streaming Flink job has stopped working 
correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The 
Keyed CEP operators that our job uses are no longer emitting Patterns reliably, 
but critically *this is only happening when parallelism is set to a value 
greater than 1*. 

Our local build tests were previously set up using in-JVM `MiniCluster` 
instances, or dockerised Flink clusters all set with a parallelism of 1, so 
this problem was not caught and it caused an outage when we upgraded the 
cluster version in production.

Observing the job using the Flink console in production, I can see that events 
are *arriving* into the Keyed CEP operators, but no Pattern events are being 
emitted out of any of the operators. Furthermore, all the reported Watermark 
values are zero, though I don't know if that is a red herring as it seems 
Watermark reporting seems to have changed since 1.14.x.

I am currently attempting to create a stripped down version of our streaming 
job to demonstrate the problem, but this is quite tricky to set up. In the 
meantime I would appreciate any hints that could point me in the right 
direction.

I have isolated the problem to the Keyed CEP operator by removing our real 
sinks and sources from the failing test. I am still seeing the erroneous 
behaviour when setting up a job as:

# Events are read from a list using `env.fromCollection( ... )`
# CEP operator processes events
# Output is captured in another list for assertions

My best guess at the moment is something to do with Watermark emission? There 
seems to have been changes related to watermark alignment, perhaps this has 
caused some kind of regression in the CEP library? To reiterate, *this problem 
only occurs with parallelism of 2 or more. Setting the parallelism to 1 
immediately fixes the issue*

  was:
(Apologies for the speculative and somewhat vague ticket, but I wanted to raise 
this while I am investigating to see if anyone has suggestions to help me 
narrow down the problem.)

We are encountering an issue where our streaming Flink job has stopped working 
correctly since Flink 1.15.3. This problem is also present on Flink 1.16.0. The 
Keyed CEP operators that our job uses are no longer emitting Patterns reliably, 
but critically *this is only happening when parallelism is set to a value 
greater than 1*. 

Our local build tests were previously set up using in-JVM `MiniCluster` 
instances, or dockerised Flink clusters all set with a parallelism of 1, so 
this problem was not caught and it caused an outage when we upgraded the 
cluster version in production.

Observing the job using the Flink console in production, I can see that events 
are *arriving* into the Keyed CEP operators, but no Pattern events are being 
emitted out of any of the operators. Furthermore, all the reported Watermark 
values are zero, though I don't know if that is a red herring as it seems 
Watermark reporting seems to have changed since 1.14.x.

I am currently attempting to create a stripped down version of our streaming 
job to demonstrate the problem, but this is quite tricky to set up. In the 
meantime I would appreciate any hints that could point me in the right 
direction.

I have isolated the problem to the Keyed CEP operator by removing our real 
sinks and sources from the failing test. I am still seeing the erroneous 
behaviour when setting up a job as:

# Events are read from a list using `env.fromCollection( ... )`
# CEP operator processes events
# Output is captured in another list for assertions

My best guess at the moment is something to do with Watermark emission? There 
seems to have been changes related to watermark alignment, perhaps this has 
caused some kind of regression in the CEP library?


> Patterns are not emitted with parallelism >1 since 1.15.x+
> ----------------------------------------------------------
>
>                 Key: FLINK-30562
>                 URL: https://issues.apache.org/jira/browse/FLINK-30562
>             Project: Flink
>          Issue Type: Bug
>          Components: Library / CEP
>    Affects Versions: 1.16.0, 1.15.3
>         Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>            Reporter: Thomas Wozniakowski
>            Priority: Major
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, perhaps this has 
> caused some kind of regression in the CEP library? To reiterate, *this 
> problem only occurs with parallelism of 2 or more. Setting the parallelism to 
> 1 immediately fixes the issue*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to