Thank you Julian for mentioning the anti-join. With its help, I managed to
solve our particular case similarly as follows:
```
SELECT e.*
FROM events e
LEFT JOIN patterns p
ON e.record_id = p.begin_record_id
WHERE e.pattern_val = 'BEGIN' AND p.begin_record_id is null
```
However, I'm thinking that such an approach will fail for more complicated
patterns than `BEGIN !END`, for example determining on which event did the
pattern `A B{1,N} A{1,N} B` time out does not seem suitable for such an
approach. Moreover, this way of proceeding seems like a workaround of
MATCH_RECOGNIZE limitations in dealing with absent events. I can’t think of a
way to make these cases solved generically, and such pattern extensions would
be the way to do that.
With regards,
Kosma
> On 22 Sep 2020, at 20:29, Julian Hyde <[email protected]> wrote:
>
> Is there a better way?
>
> I'm am idealist with regard to streaming SQL semantics, and I'm going
> to make the 'slippery slope' argument that if we add a TIMEOUT
> parameter to MATCH_RECOGNIZE, won't we also need to add it to GROUP BY
> and JOIN? (Because those are also "blocking" operators.)
>
> Maybe JOIN and GROUP BY are simpler because (absent retractions) they
> are monotonic. If more data arrives, it will not cause rows to
> disappear from your result. So, maybe anti-join is the best
> comparison. How does Flink deal with, say "show me all orders from
> customers who have not made a product return in the last 3 months"?
> You'd need a timeout on the PRODUCT_RETURNS stream, right?
>
> My hunch is that Flink can express these semantics without extending
> the syntax of JOIN, and if so, we could use the same approach to make
> MATCH_RECOGNIZE work with late data.
>
> Julian
>
> On Mon, Sep 21, 2020 at 12:05 AM Kosma Grochowski
> <[email protected]> wrote:
>>
>> Hi Jark,
>>
>> Thank you for your e-mail. I agree, let's engage all interested parties in
>> this discussion - I'm writing this e-mail to both Flink and Calcite dev
>> mailing lists.
>>
>> I'll repeat myself to present the proposal to the Calcite community.
>>
>> I would like to propose an enrichment of existing Flink SQL MATCH_RECOGNIZE
>> syntax to cover for the case of the absence of an event. Such an enrichment
>> would help our company solve a business case containing timed-out patterns
>> handling. An example of usage of such a clause from Flink training exercises
>> could be a task of identification of taxi rides with a START event that is
>> not followed by an END event within two hours. Currently, a solution to such
>> a task could be achieved with the use of CEP and a timeout handler. However,
>> as far as I know, it is impossible to take advantage of Flink SQL syntax for
>> this task.
>>
>> I can think of two ways for such a feature to be incorporated into existing
>> MATCH_RECOGNIZE syntax:
>> - In analogy to CEP, a keyword could be added which would determine, if
>> timed out matches should be dropped altogether or available either through
>> side output or main output. SQL usage could be similar to the current WITHIN
>> clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would output
>> partially matched patterns 30 seconds after A event appearance.
>>
>> - Add possibility to define absence of event inside pattern definition - for
>> example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output
>> partially matched patterns with the occurrence of A and B event 30 seconds
>> after A event appearance.
>>
>> In our company we did some basic testing of this concept - we modified
>> existing MatchCodeGenerator to add processTimedOutMatch function based on a
>> boolean trigger and tested it against the aforementioned business case
>> containing timed-out patterns handling.
>>
>> I'm interested to hear your thoughts about how we could help Flink SQL be
>> able to express these kinds of cases.
>>
>> With regards,
>> Kosma Grochowski
>>
>>
>>
>>> On 21 Sep 2020, at 05:12, Jark Wu <[email protected]> wrote:
>>>
>>> Hi Kosma,
>>>
>>> Thanks for the proposal. I like it and we also have supported similar
>>> syntax in our company.
>>> The problem is that Flink SQL leverages Calcite as the query parser, so if
>>> we want to support this syntax, we may have to push this syntax back to the
>>> Calcite community.
>>> Besides, the SQL standard doesn't define the timeout syntax for MATCH
>>> RECOGNIZE. So we have to extend the standard and this is usually not
>>> trivial.
>>>
>>> So I think it would be better to have a joint discussion with the Calcite
>>> and Flink community together. What do you think?
>>>
>>> Best,
>>> Jark
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 18 Sep 2020 at 22:48, Kosma Grochowski <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I would like to propose an enrichment of existing Flink SQL
>>>> MATCH_RECOGNIZE syntax to cover for the case of the absence of an event.
>>>> Such an enrichment would help our company solve a business case containing
>>>> timed-out patterns handling. An example of usage of such a clause from
>>>> Flink training exercises could be a task of identification of taxi rides
>>>> with a START event that is not followed by an END event within two hours.
>>>> Currently, a solution to such a task could be achieved with the use of CEP
>>>> and a timeout handler. However, as far as I know, it is impossible to take
>>>> advantage of Flink SQL syntax for this task.
>>>>
>>>> I can think of two ways for such a feature to be incorporated into
>>>> existing MATCH_RECOGNIZE syntax:
>>>> - In analogy to CEP, a keyword could be added which would determine, if
>>>> timed out matches should be dropped altogether or available either through
>>>> side output or main output. SQL usage could be similar to the current
>>>> WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would
>>>> output partially matched patterns 30 seconds after A event appearance.
>>>>
>>>> - Add possibility to define absence of event inside pattern definition -
>>>> for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output
>>>> partially matched patterns with the occurrence of A and B event 30 seconds
>>>> after A event appearance.
>>>>
>>>> In our company we did some basic testing of this concept - we modified
>>>> existing MatchCodeGenerator to add processTimedOutMatch function based on a
>>>> boolean trigger and tested it against the aforementioned business case
>>>> containing timed-out patterns handling.
>>>>
>>>>
>>>> I'm interested to hear your thoughts about how we could help Flink SQL be
>>>> able to express these kinds of cases.
>>>>
>>>> With regards,
>>>> Kosma Grochowski
>>>>
>>>>
>>>>
>>>>
>>