Thank you Julian for mentioning the anti-join. With its help, I managed to 
solve our particular case similarly as follows:

```
SELECT e.*
FROM events e
LEFT JOIN patterns p
ON e.record_id = p.begin_record_id
WHERE e.pattern_val = 'BEGIN' AND p.begin_record_id is null
```

However, I'm thinking that such an approach will fail for more complicated 
patterns than `BEGIN !END`, for example determining on which event did the 
pattern `A B{1,N} A{1,N} B` time out does not seem suitable for such an 
approach. Moreover, this way of proceeding seems like a workaround of 
MATCH_RECOGNIZE limitations in dealing with absent events. I can’t think of a 
way to make these cases solved generically, and such pattern extensions would 
be the way to do that.


With regards,
Kosma



> On 22 Sep 2020, at 20:29, Julian Hyde <jh...@apache.org> wrote:
> 
> Is there a better way?
> 
> I'm am idealist with regard to streaming SQL semantics, and I'm going
> to make the 'slippery slope' argument that if we add a TIMEOUT
> parameter to MATCH_RECOGNIZE, won't we also need to add it to GROUP BY
> and JOIN? (Because those are also "blocking" operators.)
> 
> Maybe JOIN and GROUP BY are simpler because (absent retractions) they
> are monotonic. If more data arrives, it will not cause rows to
> disappear from your result. So, maybe anti-join is the best
> comparison. How does Flink deal with, say "show me all orders from
> customers who have not made a product return in the last 3 months"?
> You'd need a timeout on the PRODUCT_RETURNS stream, right?
> 
> My hunch is that Flink can express these semantics without extending
> the syntax of JOIN, and if so, we could use the same approach to make
> MATCH_RECOGNIZE work with late data.
> 
> Julian
> 
> On Mon, Sep 21, 2020 at 12:05 AM Kosma Grochowski
> <kosma.grochow...@getindata.com> wrote:
>> 
>> Hi Jark,
>> 
>> Thank you for your e-mail. I agree, let's engage all interested parties in 
>> this discussion - I'm writing this e-mail to both Flink and Calcite dev 
>> mailing lists.
>> 
>> I'll repeat myself to present the proposal to the Calcite community.
>> 
>> I would like to propose an enrichment of existing Flink SQL MATCH_RECOGNIZE 
>> syntax to cover for the case of the absence of an event. Such an enrichment 
>> would help our company solve a business case containing timed-out patterns 
>> handling. An example of usage of such a clause from Flink training exercises 
>> could be a task of identification of taxi rides with a START event that is 
>> not followed by an END event within two hours. Currently, a solution to such 
>> a task could be achieved with the use of CEP and a timeout handler. However, 
>> as far as I know, it is impossible to take advantage of Flink SQL syntax for 
>> this task.
>> 
>> I can think of two ways for such a feature to be incorporated into existing 
>> MATCH_RECOGNIZE syntax:
>> - In analogy to CEP, a keyword could be added which would determine, if 
>> timed out matches should be dropped altogether or available either through 
>> side output or main output. SQL usage could be similar to the current WITHIN 
>> clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would output 
>> partially matched patterns 30 seconds after A event appearance.
>> 
>> - Add possibility to define absence of event inside pattern definition - for 
>> example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output 
>> partially matched patterns with the occurrence of A and B event 30 seconds 
>> after A event appearance.
>> 
>> In our company we did some basic testing of this concept - we modified 
>> existing MatchCodeGenerator to add processTimedOutMatch function based on a 
>> boolean trigger and tested it against the aforementioned business case 
>> containing timed-out patterns handling.
>> 
>> I'm interested to hear your thoughts about how we could help Flink SQL be 
>> able to express these kinds of cases.
>> 
>> With regards,
>> Kosma Grochowski
>> 
>> 
>> 
>>> On 21 Sep 2020, at 05:12, Jark Wu <imj...@gmail.com> wrote:
>>> 
>>> Hi Kosma,
>>> 
>>> Thanks for the proposal. I like it and we also have supported similar
>>> syntax in our company.
>>> The problem is that Flink SQL leverages Calcite as the query parser, so if
>>> we want to support this syntax, we may have to push this syntax back to the
>>> Calcite community.
>>> Besides, the SQL standard doesn't define the timeout syntax for MATCH
>>> RECOGNIZE. So we have to extend the standard and this is usually not
>>> trivial.
>>> 
>>> So I think it would be better to have a joint discussion with the Calcite
>>> and Flink community together. What do you think?
>>> 
>>> Best,
>>> Jark
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, 18 Sep 2020 at 22:48, Kosma Grochowski <
>>> kosma.grochow...@getindata.com> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I would like to propose an enrichment of existing Flink SQL
>>>> MATCH_RECOGNIZE syntax to cover for the case of the absence of an event.
>>>> Such an enrichment would help our company solve a business case containing
>>>> timed-out patterns handling. An example of usage of such a clause from
>>>> Flink training exercises could be a task of identification of taxi rides
>>>> with a START event that is not followed by an END event within two hours.
>>>> Currently, a solution to such a task could be achieved with the use of CEP
>>>> and a timeout handler. However, as far as I know, it is impossible to take
>>>> advantage of Flink SQL syntax for this task.
>>>> 
>>>> I can think of two ways for such a feature to be incorporated into
>>>> existing MATCH_RECOGNIZE syntax:
>>>> - In analogy to CEP, a keyword could be added which would determine, if
>>>> timed out matches should be dropped altogether or available either through
>>>> side output or main output. SQL usage could be similar to the current
>>>> WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would
>>>> output partially matched patterns 30 seconds after A event appearance.
>>>> 
>>>> - Add possibility to define absence of event inside pattern definition -
>>>> for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output
>>>> partially matched patterns with the occurrence of A and B event 30 seconds
>>>> after A event appearance.
>>>> 
>>>> In our company we did some basic testing of this concept - we modified
>>>> existing MatchCodeGenerator to add processTimedOutMatch function based on a
>>>> boolean trigger and tested it against the aforementioned business case
>>>> containing timed-out patterns handling.
>>>> 
>>>> 
>>>> I'm interested to hear your thoughts about how we could help Flink SQL be
>>>> able to express these kinds of cases.
>>>> 
>>>> With regards,
>>>> Kosma Grochowski
>>>> 
>>>> 
>>>> 
>>>> 
>> 

Reply via email to