Re: [DISCUSS] FLIP-200: Support Multiple Rule and Dynamic Rule Changing (Flink CEP)

David Morávek Thu, 30 Dec 2021 04:43:06 -0800

Hi all,

sorry for the late reply, vacation season ;) I'm still not 100% sold on
choosing the OC for this use-case, but on the other hand I don't have
strong arguments against it. Few more questions / thoughts:


We're still talking about the "web server based"
pattern_processor_discoverer, but what about other use cases? One of my big
concerns is that user's can not really reuse any part of the Flink
ecosystem to implement the discovery logic. For example if they want to
read patterns from Kafka topic, they need to roll their own discoverer
based on the vanilla Kafka client. If we're talking about extensibility,
should we also make sure that the existing primitives can be reused?

For instance, the example I gave in my previous email seems not easily
> achievable with side-input / broadcast streams: a single invalid pattern
> detected on a TM can be disabled elegantly globally without crashing the
> entire Flink job.


This can be done for the side-input as well by filtering invalid patterns
before the broadcast. You can also send the invalid patterns to any side
output you want. I have a feeling that we're way too attached to the REST
server use case in this discussion. I agree that for that case, this
solution is the most straightforward one.

OC is a 2-way communication mechanism, i.e. a subtask can also send
> OperatorEvent to the coordinator to report its owns status, so that the
> coordinator can act accordingly.
>

I agree that 2-way communication in the "data-flow like" API is tricky,
because it requires cycles / iterations, which are still not really solved
(for a good reason, it's really tough nut to crack). This makes me think
that the OC may be bit of a "incomplete" workaround for not having fully
working support for iterations.

For example I'm not really confident that the checkpointing of the OC works
correctly right now, because it doesn't seem to require checkpoint barrier
alignment as the regular stream inputs. We also don't have a proper support
for watermarking (this is again tricky, because of the cycle).

If we decide to go down this road, should we first address some of these
limitations?

OperatorCoordinator will checkpoint the full amount of PatternProcessor
> data. For the reprocessing of historical data, you can read the
> PatternProcessor snapshots saved by this checkpoint from a certain
> historical checkpoint, and then recreate the historical data through these
> PatternProcessor snapshots.
>

If I understand that correctly, this means only the LATEST state of the
patterns (in other words - patterns that are currently in use). Is this
really sufficient for historical re-processing? Can someone for example
want re-process the data in more of a "temporal join" fashion? Also AFAIK
historical processing in combination with "coordinator checkpoints" is not
really something that we currently support of the box, are there any plans
on tackling this (my other concern is that this should not go against the
"unified batch & stream processing" efforts)?

I do agree that having the user defined control logic defined in the JM
> increases the chance of instability.
>

I can imagine that if this should be a concern, we could move the execution
of the OC to the task managers. This also makes me thing, that we shouldn't
make any strong assumptions that the OC will always run in the JobManager
(this is especially relevant for the embedded web-server use case).

If an agreement is reached on OperatorCoodinator, I will start the voting
> thread.
>

As for the vote, I'd would be great if we can wait until the next week as
many people took vacation until end of the year.

Overall, I really like the feature, this will be a great addition to Flink.

Best,
D.



On Thu, Dec 30, 2021 at 11:27 AM Martijn Visser <[email protected]>
wrote:

> Hi all,
>
> I can understand the need for a control plane mechanism. I'm not the
> technical go-to person for questions on the OperatorCoordinator, but I
> would expect that we could offer those interfaces from Flink but shouldn't
> recommend running user-code in the JobManager itself. I think the user code
> (like a webserver) should run outside of Flink (like via a sidecar) and use
> only the provided interfaces to communicate.
>
> I would like to get @David Morávek <[email protected]> opinion on the
> technical part.
>
> Best regards,
>
> Martijn
>
> On Thu, 30 Dec 2021 at 10:07, Nicholas Jiang <[email protected]>
> wrote:
>
>> Hi Konstantin, Becket, Martijn,
>>
>> Thanks for sharing your feedback. What other concerns do you have about
>> OperatorCoodinator? If an agreement is reached on OperatorCoodinator, I
>> will start the voting thread.
>>
>> Best,
>> Nicholas Jiang
>>
>> On 2021/12/22 03:19:58 Becket Qin wrote:
>> > Hi Konstantin,
>> >
>> > Thanks for sharing your thoughts. Please see the reply inline below.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Tue, Dec 21, 2021 at 7:14 PM Konstantin Knauf <[email protected]>
>> wrote:
>> >
>> > > Hi Becket, Hi Nicholas,
>> > >
>> > > Thanks for joining the discussion.
>> > >
>> > > 1 ) Personally, I would argue that we should only run user code in the
>> > > Jobmanager/Jobmaster if we can not avoid it. It seems wrong to me to
>> > > encourage users to e.g. run a webserver on the Jobmanager, or
>> continuously
>> > > read patterns from a Kafka Topic on the Jobmanager, but both of these
>> I see
>> > > happening with the current design. We've had lots of issues with
>> > > classloading leaks and other stability issues on the Jobmanager and
>> making
>> > > this more complicated, if there is another way, seems unnecessary.
>> >
>> >
>> > I think the key question here is what primitive does Flink provide to
>> > facilitate the user implementation of their own control logic / control
>> > plane? It looks that previously, Flink assumes that all the user logic
>> is
>> > just data processing logic without any control / coordination
>> requirements.
>> > However, it turns out that a decent control plane abstraction is
>> required
>> > in association with the data processing logic in many cases, including
>> > Source / Sink and other user defined operators in general. The fact
>> that we
>> > ended up with adding the SplitEnumerator and GlobalCommitter are just
>> two
>> > examples of the demand of such coordination among user defined logics.
>> > There are other cases that we see in ecosystem projects, such as
>> > deep-learning-on-flink[1]. Now we see this again in CEP.
>> >
>> > Such control plane primitives are critical to the extensibility of a
>> > project. If we look at other projects, exposing such control plane
>> logic is
>> > quite common. For example, Hadoop ended up with exposing YARN as a
>> public
>> > API to the users, which is extremely popular. Kafka consumers exposed
>> the
>> > consumer group rebalance logic to the users via
>> ConsumerPartitionAssigner,
>> > which is also a control plane primitive.
>> >
>> > To me it is more important to think about how we can improve the
>> stability
>> > of such a control plane mechanism, instead of simply saying no to the
>> users.
>> >
>> >
>> >
>> >
>> > > 2) In addition, I suspect that, over time we will have to implement
>> all the
>> > > functionality that regular sources already provide around consistency
>> > > (watermarks, checkpointing) for the PatternProcessorCoordinator, too.
>> >
>> >
>> > I think OperatorCoordinator should have a generic communication
>> mechanism
>> > for all the operators, not specific to Source. We should probably have
>> an
>> > AbstractOperatorCoordinator help dealing with the communication layer,
>> and
>> > leave the state maintenance and event handling logic to the user code.
>> >
>> >
>> > > 3) I understand that running on the Jobmanager is easier if you want
>> to
>> > > launch a REST server directly. Here my question would be: does this
>> really
>> > > need to be solved inside of Flink or couldn't you start a webserver
>> next to
>> > > Flink? If we start using the Jobmanager as a REST server users will
>> expect
>> > > that e.g. it is highly available and can be load balanced and we
>> quickly
>> > > need to think about aspects that we never wanted to think about in the
>> > > context of a Flink Jobmanager.
>> > >
>> >
>> > I think the REST API is just for receiving commands targeting a running
>> > Flink job. If the job fails, the REST API would be useless.
>> >
>> >
>> > > So, can you elaborate a bit more, why a side-input/broadcast stream is
>> > >
>> > > a) more difficult
>> > > b) has vague semantics (To me semantics of a stream-stream seem
>> clearer
>> > > when it comes to out-of-orderness, late data, reprocessing or batch
>> > > execution mode.)
>> >
>> >
>> > I do agree that having the user defined control logic defined in the JM
>> > increases the chance of instability. In that case, we may think of other
>> > solutions and I am fully open to that. But the side-input / broadcast
>> > stream seems more like a bandaid instead of a carefully designed control
>> > plane mechanism.
>> >
>> > A decent control plane requires two-way communication, so information
>> can
>> > be reported / collected from the entity being controlled, and the
>> > coordinator / controller can send decisions or commands to the entities
>> > accordingly, just like our TM / JM communication. IIUC, this is not
>> > achievable with the existing side-input / broadcast stream as both of
>> them
>> > are one-way communication mechanisms. For instance, the example I gave
>> in
>> > my previous email seems not easily achievable with side-input /
>> broadcast
>> > streams: a single invalid pattern detected on a TM can be disabled
>> > elegantly globally without crashing the entire Flink job.
>> >
>> >
>> > > Cheers,
>> > >
>> > > Konstantin
>> > >
>> > >
>> > > On Tue, Dec 21, 2021 at 11:38 AM Becket Qin <[email protected]>
>> wrote:
>> > >
>> > > > Hi folks,
>> > > >
>> > > > I just finished reading the previous emails. It was a good
>> discussion.
>> > > Here
>> > > > are my two cents:
>> > > >
>> > > > *Is OperatorCoordinator a public API?*
>> > > > Regarding the OperatorCoordinator, although it is marked as internal
>> > > > interface at this point, I intended to make it a public interface
>> when
>> > > add
>> > > > that in FLIP-27. This is a powerful cross-subtask communication
>> mechanism
>> > > > that enables many use cases, Source / Sink / TF on Flink / CEP here
>> > > again.
>> > > > To my understanding, OC was marked as internal because we think it
>> is not
>> > > > stable enough yet. We may need to fortify the OperatorEvent delivery
>> > > > semantic a little bit so it works well with checkpoint in general.
>> > > >
>> > > > I think it is a valid concern that user code running in JM may cause
>> > > > instability. However, not providing this communication mechanism
>> only
>> > > makes
>> > > > a lot of use case even harder to implement. So it seems that having
>> the
>> > > OC
>> > > > exposed to end users brings more benefits than harm to Flink. At
>> the end
>> > > of
>> > > > the day, users have many ways to crash a Flink job if they do not
>> write
>> > > > proper code. So making those who know what they do happy seems more
>> > > > important here.
>> > > >
>> > > > *OperatorCoordinator V.S. side-input / broadcast stream*
>> > > > I think both of them can achieve the goal of dynamic patterns. The
>> main
>> > > > difference is the extensibility.
>> > > >
>> > > > OC is a 2-way communication mechanism, i.e. a subtask can also send
>> > > > OperatorEvent to the coordinator to report its owns status, so that
>> the
>> > > > coordinator can act accordingly. This is sometimes useful. For
>> example, a
>> > > > single invalid pattern can be disabled elegantly without crashing
>> the
>> > > > entire Flink job. In the future, if we allow users to send external
>> > > command
>> > > > to OC via JM, a default self-contained implementation can just
>> update the
>> > > > pattern via the REST API without external dependencies.
>> > > >
>> > > > Another reason I personally prefer OC is because it is an explicit
>> > > control
>> > > > plain mechanism, where as the side-input / broadcast stream has are
>> more
>> > > > vague semantic.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jiangjie (Becket) Qin
>> > > >
>> > > > On Tue, Dec 21, 2021 at 4:25 PM David Morávek <[email protected]>
>> wrote:
>> > > >
>> > > > > Hi Yunfeng,
>> > > > >
>> > > > > thanks for drafting this FLIP, this will be a great addition into
>> the
>> > > CEP
>> > > > > toolbox!
>> > > > >
>> > > > > Apart from running user code in JM, which want to avoid in
>> general, I'd
>> > > > > have one more another concern about using the OperatorCoordinator
>> and
>> > > > that
>> > > > > is re-processing of the historical data. Any thoughts about how
>> this
>> > > will
>> > > > > work with the OC?
>> > > > >
>> > > > > I have a slight feeling that a side-input (custom source /
>> operator +
>> > > > > broadcast) would a better fit for this case. This would simplify
>> the
>> > > > > consistency concerns (watermarks + pushback) and the
>> re-processing of
>> > > > > historical data.
>> > > > >
>> > > > > Best,
>> > > > > D.
>> > > > >
>> > > > >
>> > > > > On Tue, Dec 21, 2021 at 6:47 AM Nicholas Jiang <
>> > > [email protected]
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Konstantin, Martijn
>> > > > > >
>> > > > > > Thanks for the detailed feedback in the discussion. What I
>> still have
>> > > > > left
>> > > > > > to answer/reply to:
>> > > > > >
>> > > > > > -- Martijn: Just to be sure, this indeed would mean that if for
>> > > > whatever
>> > > > > > reason the heartbeat timeout, it would crash the job, right?
>> > > > > >
>> > > > > > IMO, if for whatever reason the heartbeat timeout, it couldn't
>> check
>> > > > the
>> > > > > > PatternProcessor consistency between the OperatorCoordinator
>> and the
>> > > > > > subtasks so that the job would be crashed.
>> > > > > >
>> > > > > > -- Konstantin: What I was concerned about is that we basically
>> let
>> > > > users
>> > > > > > run a UserFunction in the OperatorCoordinator, which it does
>> not seem
>> > > > to
>> > > > > > have been designed for.
>> > > > > >
>> > > > > > In general, we have reached an agreement on the design of this
>> FLIP,
>> > > > but
>> > > > > > there are some concerns on the OperatorCoordinator, about
>> whether
>> > > > > basically
>> > > > > > let users run a UserFunction in the OperatorCoordinator is
>> designed
>> > > for
>> > > > > > OperatorCoordinator. We would like to invite Becket Qin who is
>> the
>> > > > author
>> > > > > > of OperatorCoordinator to help us to answer this concern.
>> > > > > >
>> > > > > > Best,
>> > > > > > Nicholas Jiang
>> > > > > >
>> > > > > >
>> > > > > > On 2021/12/20 10:07:14 Martijn Visser wrote:
>> > > > > > > Hi all,
>> > > > > > >
>> > > > > > > Really like the discussion on this topic moving forward. I
>> really
>> > > > think
>> > > > > > > this feature will be much appreciated by the Flink users.
>> What I
>> > > > still
>> > > > > > have
>> > > > > > > left to answer/reply to:
>> > > > > > >
>> > > > > > > -- Good point. If for whatever reason the different
>> taskmanagers
>> > > > can't
>> > > > > > get
>> > > > > > > the latest rule, the Operator Coordinator could send a
>> heartbeat to
>> > > > all
>> > > > > > > taskmanagers with the latest rules and check the heartbeat
>> response
>> > > > > from
>> > > > > > > all the taskmanagers whether the latest rules of the
>> taskmanager is
>> > > > > equal
>> > > > > > > to these of the Operator Coordinator.
>> > > > > > >
>> > > > > > > Just to be sure, this indeed would mean that if for whatever
>> reason
>> > > > the
>> > > > > > > heartbeat timeout, it would crash the job, right?
>> > > > > > >
>> > > > > > > -- We have consided about the solution mentioned above. In
>> this
>> > > > > > solution, I
>> > > > > > > have some questions about how to guarantee the consistency of
>> the
>> > > > rule
>> > > > > > > between each TaskManager. By having a coodinator in the
>> JobManager
>> > > to
>> > > > > > > centrally manage the latest rules, the latest rules of all
>> > > > TaskManagers
>> > > > > > are
>> > > > > > > consistent with those of the JobManager, so as to avoid the
>> > > > > > inconsistencies
>> > > > > > > that may be encountered in the above solution. Can you
>> introduce
>> > > how
>> > > > > this
>> > > > > > > solution guarantees the consistency of the rules?
>> > > > > > >
>> > > > > > > The consistency that we could guarantee was based on how
>> often each
>> > > > > > > TaskManager would do a refresh and how often we would accept a
>> > > > refresh
>> > > > > to
>> > > > > > > fail. We set the refresh time to a relatively short one (30
>> > > seconds)
>> > > > > and
>> > > > > > > maximum failures to 3. That meant that we could guarantee that
>> > > rules
>> > > > > > would
>> > > > > > > be updated in < 2 minutes or else the job would crash. That
>> was
>> > > > > > sufficient
>> > > > > > > for our use cases. This also really depends on how big your
>> cluster
>> > > > > is. I
>> > > > > > > can imagine that if you have a large scale cluster that you
>> want to
>> > > > > run,
>> > > > > > > you don't want to DDOS the backend system where you have your
>> rules
>> > > > > > stored.
>> > > > > > >
>> > > > > > > -- In summary, the current design is that JobManager tells all
>> > > > > > TaskManagers
>> > > > > > > the latest rules through OperatorCoodinator, and will
>> initiate a
>> > > > > > heartbeat
>> > > > > > > to check whether the latest rules on each TaskManager are
>> > > consistent.
>> > > > > We
>> > > > > > > will describe how to deal with the Failover scenario in more
>> detail
>> > > > on
>> > > > > > FLIP.
>> > > > > > >
>> > > > > > > Thanks for that. I think having the JobManager tell the
>> > > TaskManagers
>> > > > > the
>> > > > > > > applicable rules would indeed end up being the best design.
>> > > > > > >
>> > > > > > > -- about the concerns around consistency raised by Martijn: I
>> > > think a
>> > > > > lot
>> > > > > > > of those can be mitigated by using an event time timestamp
>> from
>> > > which
>> > > > > the
>> > > > > > > rules take effect. The reprocessing scenario, for example, is
>> > > covered
>> > > > > by
>> > > > > > > this. If a pattern processor should become active as soon as
>> > > > possible,
>> > > > > > > there will still be inconsistencies between Taskmanagers, but
>> "as
>> > > > soon
>> > > > > as
>> > > > > > > possible" is vague anyway, which is why I think that's ok.
>> > > > > > >
>> > > > > > > I think an event timestamp is indeed a really important one.
>> We
>> > > also
>> > > > > used
>> > > > > > > that in my previous role, with the ruleActivationTimestamp
>> compared
>> > > > to
>> > > > > > > eventtime (well, actually we used Kafka logAppend time because
>> > > > > > > eventtime wasn't always properly set so we used that time to
>> > > > overwrite
>> > > > > > the
>> > > > > > > eventtime from the event itself).
>> > > > > > >
>> > > > > > > Best regards,
>> > > > > > >
>> > > > > > > Martijn
>> > > > > > >
>> > > > > > > On Mon, 20 Dec 2021 at 09:08, Konstantin Knauf <
>> [email protected]>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Nicholas, Hi Junfeng,
>> > > > > > > >
>> > > > > > > > about the concerns around consistency raised by Martijn: I
>> think
>> > > a
>> > > > > lot
>> > > > > > of
>> > > > > > > > those can be mitigated by using an event time timestamp from
>> > > which
>> > > > > the
>> > > > > > > > rules take effect. The reprocessing scenario, for example,
>> is
>> > > > covered
>> > > > > > by
>> > > > > > > > this. If a pattern processor should become active as soon as
>> > > > > possible,
>> > > > > > > > there will still be inconsistencies between Taskmanagers,
>> but "as
>> > > > > soon
>> > > > > > as
>> > > > > > > > possible" is vague anyway, which is why I think that's ok.
>> > > > > > > >
>> > > > > > > > about naming: The naming with "PatternProcessor" sounds
>> good to
>> > > me.
>> > > > > > Final
>> > > > > > > > nit: I would go for CEP#patternProccessors, which would be
>> > > > consistent
>> > > > > > with
>> > > > > > > > CEP#pattern.
>> > > > > > > >
>> > > > > > > > I am not sure about one of the rejected alternatives:
>> > > > > > > >
>> > > > > > > > > Have each subtask of an operator make the update on their
>> own
>> > > > > > > >
>> > > > > > > >    -
>> > > > > > > >
>> > > > > > > >    It is hard to achieve consistency.
>> > > > > > > >    -
>> > > > > > > >
>> > > > > > > >       Though the time interval that each subtask makes the
>> update
>> > > > can
>> > > > > > be
>> > > > > > > >       the same, the absolute time they make the update
>> might be
>> > > > > > different.
>> > > > > > > > For
>> > > > > > > >       example, one makes updates at 10:00, 10:05, etc, while
>> > > > another
>> > > > > > does
>> > > > > > > > it at
>> > > > > > > >       10:01, 10:06. In this case the subtasks might never
>> > > > processing
>> > > > > > data
>> > > > > > > > with
>> > > > > > > >       the same set of pattern processors.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I would have thought that it is quite easy to poll for the
>> rules
>> > > > from
>> > > > > > each
>> > > > > > > > Subtask at *about *the same time. So, this alone does not
>> seem to
>> > > > be
>> > > > > > > > enough to rule out this option. I've looped in David
>> Moravek to
>> > > get
>> > > > > his
>> > > > > > > > opinion of the additional load imposed on the JM.
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > >
>> > > > > > > > Konstantin
>> > > > > > > >
>> > > > > > > > On Mon, Dec 20, 2021 at 4:06 AM Nicholas Jiang <
>> > > > > > [email protected]>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Yue,
>> > > > > > > > >
>> > > > > > > > > Thanks for your feedback of the FLIP. I have addressed
>> your
>> > > > > > questions and
>> > > > > > > > > made a corresponding explanation as follows:
>> > > > > > > > >
>> > > > > > > > > -- About Pattern Updating. If we use
>> > > PatternProcessoerDiscoverer
>> > > > to
>> > > > > > > > update
>> > > > > > > > > the rules, will it increase the load of JM? For example,
>> if the
>> > > > > user
>> > > > > > > > wants
>> > > > > > > > > the updated rule to take effect immediately,, which means
>> that
>> > > we
>> > > > > > need to
>> > > > > > > > > set a shorter check interval or there is another scenario
>> when
>> > > > > users
>> > > > > > > > rarely
>> > > > > > > > > update the pattern, will the PatternProcessoerDiscoverer
>> be in
>> > > > most
>> > > > > > of
>> > > > > > > > the
>> > > > > > > > > time Do useless checks ? Will a lazy update mode could be
>> used,
>> > > > > > which the
>> > > > > > > > > pattern only be updated when triggered by the user, and do
>> > > > nothing
>> > > > > at
>> > > > > > > > other
>> > > > > > > > > times?
>> > > > > > > > >
>> > > > > > > > > PatternProcessoerDiscoverer is a user-defined interface to
>> > > > discover
>> > > > > > the
>> > > > > > > > > PatternProcessor updates. Periodically checking the
>> > > > > PatternProcessor
>> > > > > > in
>> > > > > > > > the
>> > > > > > > > > database is a implementation of the
>> PatternProcessoerDiscoverer
>> > > > > > > > interface,
>> > > > > > > > > which is that periodically querys all the PatternProcessor
>> > > table
>> > > > in
>> > > > > > > > certain
>> > > > > > > > > interval. This implementation indeeds has the useless
>> checks,
>> > > and
>> > > > > > could
>> > > > > > > > > directly integrates the changelog of the table. In
>> addition, in
>> > > > > > addition
>> > > > > > > > to
>> > > > > > > > > the implementation of periodically checking the database,
>> there
>> > > > are
>> > > > > > other
>> > > > > > > > > implementations such as the PatternProcessor that provides
>> > > > Restful
>> > > > > > > > services
>> > > > > > > > > to receive updates.
>> > > > > > > > >
>> > > > > > > > > --  I still have some confusion about how Key Generating
>> > > > Opertator
>> > > > > > and
>> > > > > > > > > CepOperator (Pattern Matching & Processing Operator) work
>> > > > together.
>> > > > > > If
>> > > > > > > > > there are N PatternProcessors, will the Key Generating
>> > > Opertator
>> > > > > > > > generate N
>> > > > > > > > > keyedStreams, and then N CepOperator would process each
>> Key
>> > > > > > separately ?
>> > > > > > > > Or
>> > > > > > > > > every CepOperator Task would process all patterns, if so,
>> does
>> > > > the
>> > > > > > key
>> > > > > > > > type
>> > > > > > > > > in each PatternProcessor need to be the same?
>> > > > > > > > >
>> > > > > > > > > Firstly the Pattern Matching & Processing Operator is not
>> the
>> > > > > > CepOperator
>> > > > > > > > > at present, because CepOperator mechanism is based on the
>> > > > NFAState.
>> > > > > > > > > Secondly if there are N PatternProcessors, the Key
>> Generating
>> > > > > > Opertator
>> > > > > > > > > combines all the keyedStreams with keyBy() operation,
>> thus the
>> > > > > > Pattern
>> > > > > > > > > Matching & Processing Operator would process all the
>> patterns.
>> > > In
>> > > > > > other
>> > > > > > > > > words, the KeySelector of the PatternProcessor is used
>> for the
>> > > > Key
>> > > > > > > > > Generating Opertator, and the Pattern and
>> > > PatternProceessFunction
>> > > > > of
>> > > > > > the
>> > > > > > > > > PatternProcessor are used for the Pattern Matching &
>> Processing
>> > > > > > Operator.
>> > > > > > > > > Lastly the key type in each PatternProcessor is the same,
>> > > > regarded
>> > > > > as
>> > > > > > > > > Object type.
>> > > > > > > > >
>> > > > > > > > > -- Maybe need to pay attention to it when implementing it
>> .If
>> > > > some
>> > > > > > > > Pattern
>> > > > > > > > > has been removed or updated, will the partially matched
>> results
>> > > > in
>> > > > > > > > > StateBackend would be clean up or We rely on state ttl to
>> clean
>> > > > up
>> > > > > > these
>> > > > > > > > > expired states.
>> > > > > > > > >
>> > > > > > > > > If certain Pattern has been removed or updated, the
>> partially
>> > > > > matched
>> > > > > > > > > results in StateBackend would be clean up until the next
>> > > > > checkpoint.
>> > > > > > The
>> > > > > > > > > partially matched result doesn't depend on the state ttl
>> of the
>> > > > > > > > > StateBackend.
>> > > > > > > > >
>> > > > > > > > > 4. Will the PatternProcessorManager keep all the active
>> > > > > > PatternProcessor
>> > > > > > > > > in memory? We have also Support Multiple Rule and Dynamic
>> Rule
>> > > > > > Changing.
>> > > > > > > > > But we are facing such a problem, some users’ usage
>> scenarios
>> > > are
>> > > > > > that
>> > > > > > > > they
>> > > > > > > > > want to have their own pattern for each user_id, which
>> means
>> > > that
>> > > > > > there
>> > > > > > > > > could be thousands of patterns, which would make the
>> > > performance
>> > > > of
>> > > > > > > > Pattern
>> > > > > > > > > Matching very poor. We are also trying to solve this
>> problem.
>> > > > > > > > >
>> > > > > > > > > The PatternProcessorManager keeps all the active
>> > > PatternProcessor
>> > > > > in
>> > > > > > > > > memory. For scenarios that they want to have their own
>> pattern
>> > > > for
>> > > > > > each
>> > > > > > > > > user_id, IMO, is it possible to reduce the fine-grained
>> pattern
>> > > > of
>> > > > > > > > > PatternProcessor to solve the performance problem of the
>> > > Pattern
>> > > > > > > > Matching,
>> > > > > > > > > for example, a pattern corresponds to a group of users?
>> The
>> > > > > scenarios
>> > > > > > > > > mentioned above need to be solved by case by case.
>> > > > > > > > >
>> > > > > > > > > Best,
>> > > > > > > > > Nicholas Jiang
>> > > > > > > > >
>> > > > > > > > > On 2021/12/17 11:43:10 yue ma wrote:
>> > > > > > > > > > Glad to see the Community's progress in Flink CEP. After
>> > > > reading
>> > > > > > this
>> > > > > > > > > Flip,
>> > > > > > > > > > I have few questions, would you please take a look  ?
>> > > > > > > > > >
>> > > > > > > > > > 1. About Pattern Updating. If we use
>> > > > PatternProcessoerDiscoverer
>> > > > > to
>> > > > > > > > > update
>> > > > > > > > > > the rules, will it increase the load of JM? For
>> example, if
>> > > the
>> > > > > > user
>> > > > > > > > > wants
>> > > > > > > > > > the updated rule to take effect immediately,, which
>> means
>> > > that
>> > > > we
>> > > > > > need
>> > > > > > > > to
>> > > > > > > > > > set a shorter check interval  or there is another
>> scenario
>> > > when
>> > > > > > users
>> > > > > > > > > > rarely update the pattern, will the
>> > > PatternProcessoerDiscoverer
>> > > > > be
>> > > > > > in
>> > > > > > > > > most
>> > > > > > > > > > of the time Do useless checks ? Will a lazy update mode
>> could
>> > > > be
>> > > > > > used,
>> > > > > > > > > > which the pattern only be updated when triggered by the
>> user,
>> > > > and
>> > > > > > do
>> > > > > > > > > > nothing at other times ?
>> > > > > > > > > >
>> > > > > > > > > > 2.   I still have some confusion about how Key
>> Generating
>> > > > > > Opertator and
>> > > > > > > > > > CepOperator (Pattern Matching & Processing Operator)
>> work
>> > > > > > together. If
>> > > > > > > > > > there are N PatternProcessors, will the Key Generating
>> > > > Opertator
>> > > > > > > > > generate N
>> > > > > > > > > > keyedStreams, and then N CepOperator would process each
>> Key
>> > > > > > separately
>> > > > > > > > ?
>> > > > > > > > > Or
>> > > > > > > > > > every CepOperator Task would process all patterns, if
>> so,
>> > > does
>> > > > > the
>> > > > > > key
>> > > > > > > > > type
>> > > > > > > > > > in each PatternProcessor need to be the same ?
>> > > > > > > > > >
>> > > > > > > > > > 3. Maybe need to pay attention to it when implementing
>> it .If
>> > > > > some
>> > > > > > > > > Pattern
>> > > > > > > > > > has been removed or updateed  ,will the partially
>> matched
>> > > > results
>> > > > > > in
>> > > > > > > > > > StateBackend would be clean up or We rely on state ttl
>> to
>> > > clean
>> > > > > up
>> > > > > > > > these
>> > > > > > > > > > expired states.
>> > > > > > > > > >
>> > > > > > > > > > 4. Will the PatternProcessorManager keep all the active
>> > > > > > > > PatternProcessor
>> > > > > > > > > in
>> > > > > > > > > > memory ? We have also Support Multiple Rule and Dynamic
>> Rule
>> > > > > > Changing .
>> > > > > > > > > > But we are facing such a problem, some users’ usage
>> scenarios
>> > > > are
>> > > > > > that
>> > > > > > > > > they
>> > > > > > > > > > want to have their own pattern for each user_id, which
>> means
>> > > > that
>> > > > > > there
>> > > > > > > > > > could be thousands of patterns, which would make the
>> > > > performance
>> > > > > of
>> > > > > > > > > Pattern
>> > > > > > > > > > Matching very poor. We are also trying to solve this
>> problem.
>> > > > > > > > > >
>> > > > > > > > > > Yunfeng Zhou <[email protected]>
>> 于2021年12月10日周五
>> > > > > 19:16写道：
>> > > > > > > > > >
>> > > > > > > > > > > Hi all，
>> > > > > > > > > > >
>> > > > > > > > > > > I'm opening this thread to propose the design to
>> support
>> > > > > multiple
>> > > > > > > > rule
>> > > > > > > > > &
>> > > > > > > > > > > dynamic rule changing in the Flink-CEP project, as
>> > > described
>> > > > in
>> > > > > > > > > FLIP-200
>> > > > > > > > > > > [1]
>> > > > > > > > > > > .
>> > > > > > > > > > >
>> > > > > > > > > > > Currently Flink CEP only supports having a single
>> pattern
>> > > > > inside
>> > > > > > a
>> > > > > > > > > > > CepOperator and does not support changing the pattern
>> > > > > > dynamically. In
>> > > > > > > > > order
>> > > > > > > > > > > to reduce resource consumption and to experience
>> shorter
>> > > > > downtime
>> > > > > > > > > during
>> > > > > > > > > > > pattern updates, there is a growing need in the
>> production
>> > > > > > > > environment
>> > > > > > > > > that
>> > > > > > > > > > > expects CEP to support having multiple patterns in one
>> > > > operator
>> > > > > > and
>> > > > > > > > to
>> > > > > > > > > > > support dynamically changing them. Therefore I
>> propose to
>> > > add
>> > > > > > certain
>> > > > > > > > > > > infrastructure as described in FLIP-200 to support
>> these
>> > > > > > > > > functionalities.
>> > > > > > > > > > >
>> > > > > > > > > > > Please feel free to reply to this email thread.
>> Looking
>> > > > forward
>> > > > > > to
>> > > > > > > > your
>> > > > > > > > > > > feedback!
>> > > > > > > > > > >
>> > > > > > > > > > > [1]
>> > > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195730308
>> > > > > > > > > > >
>> > > > > > > > > > > Best regards,
>> > > > > > > > > > >
>> > > > > > > > > > > Yunfeng
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > >
>> > > > > > > > Konstantin Knauf
>> > > > > > > >
>> > > > > > > > https://twitter.com/snntrable
>> > > > > > > >
>> > > > > > > > https://github.com/knaufk
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > >
>> > > Konstantin Knauf
>> > >
>> > > https://twitter.com/snntrable
>> > >
>> > > https://github.com/knaufk
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-200: Support Multiple Rule and Dynamic Rule Changing (Flink CEP)

Reply via email to