[jira] [Commented] (CALCITE-1935) Reference implementation for MATCH_RECOGNIZE

Julian Hyde (JIRA) Thu, 07 Sep 2017 11:23:48 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157391#comment-16157391
 ]


Julian Hyde commented on CALCITE-1935:
--------------------------------------

There is an old result that a regexp is equivalent to an NFA (or a DFA, for 
that matter). See https://en.wikipedia.org/wiki/Regular_language. So, whatever 
sequence of events your NFA recognizes, there is an equivalent regular 
expression that matches precisely the same set of events.

In other words, you can save yourself the effort of building an NFA by hacking 
the one inside the Java regexp implementation.

I can see how an NFA would be much more efficient than other approaches. But if 
efficiency were not an issue, would you still build it?

I strongly believe that your should build a very inefficient implementation 
first. O(n ^ 5) would be fine. Then when we have 20 running test queries, start 
on the second implementation, which will be efficient and perhaps also be able 
to run on unbounded input.

> Reference implementation for MATCH_RECOGNIZE
> --------------------------------------------
>
>                 Key: CALCITE-1935
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1935
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>
> We now have comprehensive support for parsing and validating MATCH_RECOGNIZE 
> queries (see CALCITE-1570 and sub-tasks) but we cannot execute them. I know 
> the purpose of this work is to do CEP within Flink, but a reference 
> implementation that works on non-streaming data would be valuable.
> I propose that we add a class EnumerableMatch that can generate Java code to 
> evaluate MATCH_RECOGNIZE queries on Enumerable data. It does not need to be 
> efficient. I don't mind if it (say) buffers all the data in memory and makes 
> O(n ^ 3) passes over it. People can make it more efficient over time.
> When we have a reference implementation, people can start playing with this 
> feature. And we can start building a corpus of data sets, queries, and their 
> expected result. The Flink implementation will be able to test against those 
> same queries, and should give the same results, even though Flink will be 
> reading streaming data.
> Let's create {{match.iq}} with the following query based on 
> https://oracle-base.com/articles/12c/pattern-matching-in-oracle-database-12cr1:
> {code}
> !set outputformat mysql
> !use match
> SELECT *
> FROM sales_history MATCH_RECOGNIZE (
>          PARTITION BY product
>          ORDER BY tstamp
>          MEASURES  STRT.tstamp AS start_tstamp,
>                    LAST(UP.tstamp) AS peak_tstamp,
>                    LAST(DOWN.tstamp) AS end_tstamp,
>                    MATCH_NUMBER() AS mno
>          ONE ROW PER MATCH
>          AFTER MATCH SKIP TO LAST DOWN
>          PATTERN (STRT UP+ FLAT* DOWN+)
>          DEFINE
>            UP AS UP.units_sold > PREV(UP.units_sold),
>            FLAT AS FLAT.units_sold = PREV(FLAT.units_sold),
>            DOWN AS DOWN.units_sold < PREV(DOWN.units_sold)
>        ) MR
> ORDER BY MR.product, MR.start_tstamp;
> PRODUCT    START_TSTAM PEAK_TSTAMP END_TSTAMP         MNO
> ---------- ----------- ----------- ----------- ----------
> TWINKIES   01-OCT-2014 03-OCT-2014 06-OCT-2014          1
> TWINKIES   06-OCT-2014 08-OCT-2014 09-OCT-2014          2
> TWINKIES   09-OCT-2014 13-OCT-2014 16-OCT-2014          3
> TWINKIES   16-OCT-2014 18-OCT-2014 20-OCT-2014          4
> 4 rows selected.
> !ok
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (CALCITE-1935) Reference implementation for MATCH_RECOGNIZE

Reply via email to