Brian Lehman created FLINK-33538:
------------------------------------

             Summary: Create a "best match" option for CEP when using the 
optional operator in the Pattern API 
                 Key: FLINK-33538
                 URL: https://issues.apache.org/jira/browse/FLINK-33538
             Project: Flink
          Issue Type: Improvement
          Components: Library / CEP
            Reporter: Brian Lehman


I’ll try to detail my issue and the initial experimentation I’ve done in the 
CEP source code to show the potential of making this nearly as fast as having 
no optional events in the pattern.

 

Using Flink CEP I have implemented a pattern that detects a sequence of events. 
Each event has internal attributes that are checked and inter-event time deltas 
are also checked as part of the pattern.

 

When I require all events in my sequence (say 10 elements long) Flink CEP works 
well and is super fast at detecting the “perfect” sequences. Should one or more 
of the events not get recorded, I would still like to detect the partial 
sequence. Since I don’t know which events might be missing I must make all of 
the events optional. While this worked, it was significantly slower, to the 
point that it is an unworkable solution.

 

CEP is set to alert as fast as possible – so when everything is optional, once 
it passes a single event the path to final in ComputationState is immediately 
found and the potentialMatch is emitted. If a perfect match is coming, I will 
not find the pattern unless the skip strategy is set to noSkip(). Because all 
are optional, in the instance of a perfect match, CEP wants to emit all 2^10 
possible matches when noSkip() is used. This is what causes the extreme 
performance drop off.

 

I would like to add an option to Flink CEP that allows the “best match” when 
the optional() attribute is used in the Pattern API. The default can still be 
“fastest match” which would operate as it does today. In the “best match” 
scenario, the potentialMatches would be held back (not emitted) until the full 
pattern time has passed.

 

I have made some modifications to the source code (primarily 
org.apache.flink.cep.nfa.NFA) that show this can work and be nearly as fast as 
the perfect match option. My basic strategy is the following:
 * Once a longer match is found, discard all of the shorter length sub-matches 
from partialMatches and potentialMatches.
 * Eg –  [E1, $endState$]
 * [E1, E2, $endState$] - discard the [E1, $endState$] and [E2, $endState$] and 
those partialMatch paths as well
 * [E1, E2, E3, $endState$] – discard the [E1, E2, $endState$], etc.


 * DeweyNumber allows a way to quickly assess match lengths, by comparing 
DeweyNumber.lengths.
 * DeweyNumber provides a way to assess matches, with its versioning – matches 
tend to use a version of zero (eg. 1.0.0.0.0.0 is a perfect match to the first 
6 events of the pattern)
 * Delay emitting the “best match” until the options are exhausted, eliminating 
the shorter matches as quickly as possible.

 

I’m at a place in my modifications where it would be very beneficial to work 
with an expert that would understand how best to accomplish this without 
impacting current functionality/performance. I’m happy to collaborate and share 
my work if that helps. My current modifications are a combination of mods and 
logs to allow me to see what is going on internally in the CEP processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to