Re: MATCH_RECOGNIZE

Julian Feinauer Fri, 28 Dec 2018 22:27:13 -0800

Hi Julian,

I see your argument with the DFA. The reason I used it is get a (possibly 
inefficient) working solution and it helped me to reduce some code complexity, 
see the reduction of code in the Matcher.matchWithSymbol Method [1].
But, I think we can easily factor out the code for DFA creation and keep only 
the epsilon removal.
I can do this if you want.


To your other point. I did never intention to step on your feet with this and I 
really have to excuse for that. My only intention was to help get the 
MATCH_RECOGNIZE implementation working as offered in recent conversations. So I 
suggest that I stop working on this and let you go on. If you need some of my 
code I'll prepare the respective parts for a PR to your branch.

Let me know if there are other things I can do to help you with the 
MATCH_RECOGNIZE implementation until then I'll stop my efforts on this topic.

JulianF

[1] 
https://github.com/JulianFeinauer/calcite/blob/1dc707278f2e23e712c29b495a67084689cf825e/core/src/main/java/org/apache/calcite/runtime/Matcher.java

Am 29.12.18, 01:30 schrieb "Julian Hyde" <[email protected]>:

    I hesitated to make the automaton deterministic because the DFA can have 
exponentially more states than its corresponding NFA, and I was concerned that 
this worst case might occur in real-world queries. I could imagine a query with 
20 symbols and 1 million states, or 30 symbols and 1 billion states. Are you 
confident that the DFA will always have a reasonable size?
    
    Whether or not we go to DFA, removing epsilon transitions does seem 
worthwhile. I think that would cause only a small increase in the number of 
states.
    
    I’m going to stop working on this code, until you reach a stopping point. 
There’s no point in us treading on each other’s feet. It’s a shame, because I 
was looking forward to working on this code over the Xmas break.
    
    Julian
    
    
    > On Dec 28, 2018, at 3:02 PM, Julian Feinauer 
<[email protected]> wrote:
    > 
    > Hi Julian,
    > 
    > as it got really confusing for me with the eps-NFA (and all the 
concurrent partial matches in the matcher) I added a class DFA that transforms 
the epsilon-NFA from an automaton to a DFA and reimplemented the Matcher based 
on it.
    > I think it is running now but it fails on tests that contain a "repeat" 
statement (X{a,b}) in AutomatonTest.
    > The constructed DFA from these statements is wrong.
    > Do you have a reference of the transformation in your Thompson 
construction? I found nothing on quick googling.
    > I would like to check the implementation in the AutomatonBuilder:202 ff 
as I did not find a Bug in the Matcher or the DFA code.
    > Otherwise I would try to check it by expanding the repeat with symbols, 
ors and optionals.
    > 
    > I also fixed the coding style (sorry for that, I totally missed that) so 
it should be better to review my code now (as of commit 
https://github.com/julianhyde/calcite/pull/15/commits/358ca1c5928b57cc96c8b39be8d017872d870dcf
 ).
    > 
    > Best
    > JulianF
    > 
    > Am 28.12.18, 02:19 schrieb "Julian Hyde" <[email protected]>:
    > 
    >    I think we should get one example working end-to-end before moving to 
the next. (By “example” I mean a SQL query on a standard data set that 
exercises one new feature, say FINAL.) Right now nothing works end-to-end 
because we don’t have the basic code generation working.
    > 
    >    I agree that assigning symbols to matched rows is a hard problem. I 
think the best approach is to first figure out whether there is a match (that’s 
what Automaton does) and then, in a second pass, assign a symbol to each row. 
The second pass might be significantly slower, but only occurs less often. 
Also, AFAICT, symbol assignment is only required for the CLASSIFIER() function. 
So I was going to defer that task.
    > 
    >    Yes, I am actively working on this code. I see that your branch has 
significant changes because you use a different coding style (e.g. different 
indentation). Please change your code back to the existing style. There is no 
reason to make the task even more difficult than it already is.
    > 
    >    Julian
    > 
    > 
    >> On Dec 27, 2018, at 1:19 PM, Julian Feinauer 
<[email protected]> wrote:
    >> 
    >> Hi Julian,
    >> 
    >> regarding "^" and "$" it seems like Zhiqiang already introduced the 
fields strictStart and strictEnd in org.apache.calcite.rel.core.Match. But I 
agree with you and already had the same idea. And I'll go over to you last 
commit to start my branch off.
    >> 
    >> I made some progress in my branch [1]. I get it to compile and I get the 
test `JdbcTest.testMatch` to run and to fail (but no longer throw an exception, 
at least).
    >> I fixed several things at several places and I think the code generation 
is now working (NOT working good) for Matcher and Emitter.
    >> But there are “crucial” points where I’d like to have your advice (or 
someone else familiar with these topics):
    >> 
    >> First, I’m unsure how the FINAL function should be implemented (it’s no 
regular operator and I did not find any reference on how to deal with this) so 
I “shortcutet” it by a reference to the ABS function which is “noop” in the 
test case, see RexImplTable:385.
    >> 
    >> I also have no real idea about the implementation of PREV / LAST 
Operators. I think there are some similarities to Window Aggregates and the 
PRECEDING / FOLLOWING operators, like RexWindowBound.
    >> 
    >> But currently I started working on a refactoring of the Matcher. 
Currently it only returns the rows that matched but not the respective symbols 
the rows where matched to. They are necessary for the emitter. I'm unsure 
whether to keep it based on an NFA or it is easier with a DFA. 
    >> 
    >> Before I continue and dig more through the code base it would be good 
for me to have some kind of feedback whether I’m going in the right direction 
and the things I do are of any value or if I misunderstood or misinterpreted 
some parts.
    >> 
    >> JulianF
    >> 
    >> PS.: Are you actively working on the branch? We should synchronize to 
avoid duplicate work.
    >> 
    >> [1] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
    >> 
    >> Am 27.12.18, 21:21 schrieb "Julian Hyde" <[email protected]>:
    >> 
    >>   I think you can implement “^” by adding a special BEGIN state to the 
automaton. Each automaton should be in this state on creation, and there is no 
inbound transition (i.e. no way to get back into this state).
    >> 
    >>   And you can implement “$” by adding a special end-of-data symbol (you 
might as well call it “$”) that is sent to each partition’s automaton when the 
input ends.
    >> 
    >>   These seem to be elegant solutions because most of the work is in 
Pattern and Automaton, and can be unit-tested in AutomatonTest. Just a little 
extra plumbing needs to be added to the runtime in order to use it.
    >> 
    >>   As you have noticed my branch 
https://github.com/julianhyde/calcite/tree/1935-match-recognize/ 
<https://github.com/julianhyde/calcite/tree/1935-match-recognize/> is broken as 
of the latest commit. Consider starting your branch from the previous commit 
https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7
 
<https://github.com/julianhyde/calcite/commit/ea20e84c2d0cf636d2279d182be6df2ef65b67d7>.
 We can sync up when my branch is working.
    >> 
    >>   Julian
    >> 
    >> 
    >>> On Dec 26, 2018, at 6:44 AM, Julian Feinauer 
<[email protected]> wrote:
    >>> 
    >>> Hi Julian,
    >>> 
    >>> I used [1] as reference. Anchors are explicitly stated as part of the 
syntax and explained as:
    >>> 
    >>>> Anchors work in terms of positions rather than rows. They match a 
position either at the start or end of a partition.
    >>>>  ^ matches the position before the first row in the partition.
    >>>>  $ matches the position after the last row in the partition.
    >>>> As an example, PATTERN (^A+$) will match only if all rows in a 
partition satisfy the condition for A. The resulting match spans the entire 
partition.
    >>> 
    >>> Regarding patterns, I think it should not be a big change, as the 
anchors are defined with respect to partition boundaries. So technically they 
do not have to see "beyond" boundaries but should simply "see" boundaries.
    >>> So all we need should be an "outside partition" state which CAN be used 
as starting or ending state (basically symbols "^" and "$" should reference 
that).
    >>> 
    >>> I'll see if I find a solution based on your code... I'll do the work in 
my branch [2] based on your branch [3].
    >>> 
    >>> Best
    >>> JulianF
    >>> 
    >>> [1] https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956
    >>> [2] https://github.com/JulianFeinauer/calcite/tree/1935-match-recognize
    >>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize
    >>> 
    >>> Am 26.12.18, 08:49 schrieb "Julian Hyde" <[email protected]>:
    >>> 
    >>>  You are correct that my 1935-match-recognize branch doesn’t compile 
(as of 1a552a9). I committed and pushed in the middle of a change because I had 
done a non-trivial rebase.
    >>> 
    >>>  I haven’t missed a file; the two compilation errors were intended to 
remind me where to start work again. I am working on generating code to emit 
rows, and to populate measures and predicates from the input row. If you can 
make progress on that, that would be awesome.
    >>> 
    >>>  Are anchors (“^” and “$”) supported by Oracle? If so can you point me 
to the spec/examples. I am surprised that anything to do with patterns needs 
see beyond the boundaries of the current partition. I had assumed that each 
partition has its own state machine and it will be difficult to change that.
    >>> 
    >>>  Julian
    >>> 
    >>> 
    >>>> On Dec 25, 2018, at 2:56 PM, Julian Feinauer 
<[email protected]> wrote:
    >>>> 
    >>>> Hey,
    >>>> 
    >>>> it's once again me, JulianF.
    >>>> I started work on the Automaton / Matcher and implemented OR and 
OPTIONAL ("?") to get started with the code.
    >>>> I would highly appreciate if you (Julian H) could check this code (I 
made a PR to your branch).
    >>>> Then, what else did you consider as necessary for the implementation?
    >>>> I thought about anchors ("^", "$") but this would need a little bit of 
extra changes in the PartitionStates, as far as I see it (to check when we 
"enter" a partition and when we "leave".
    >>>> 
    >>>> Best
    >>>> JulianF
    >>>> 
    >>>> Am 25.12.18, 20:38 schrieb "Julian Feinauer" 
<[email protected]>:
    >>>> 
    >>>> Hi Julian,
    >>>> 
    >>>> as I already declared my interest in MATCH_RECOGNIZE and offered my 
help, I plan to do some things in the next one or two weeks.
    >>>> Thus, I wanted to start based on your branch (“1935-match-recognize”).
    >>>> 
    >>>> I have some problems getting it to run.
    >>>> Is it possible that there are some files missing in the commit or are 
there some things to consider?
    >>>> 
    >>>> Thanks!
    >>>> Julian (F)
    >>>> 
    >>>> On 2018/11/26 20:09:00, Julian Hyde 
<[email protected]<mailto:[email protected]>> wrote:
    >>>>> Over thanksgiving, I started working on MATCH_RECOGNIZE again. I 
wrote a standalone class called Automaton that allows you to build patterns 
(basically regular expressions, but sufficient for the PATTERN sub-clause of 
MATCH_RECOGNIZE), and execute them in a unit test.>
    >>>>> 
    >>>>> Would someone like to help me develop this? We have support for “*” 
(zero or more repeats), “+” (1 or more repeats) and “{m,n}” (bounded repeats) 
but need “|” (or) and several others. It should be fairly straightforward 
test-driven development: add tests to AutomatonTest.java [1], then change 
Automaton, AutomatonBuilder, Pattern or Matcher until they pass.>
    >>>>> 
    >>>>> We also need lots of SQL tests. Could someone write queries against 
Oracle’s “ticker” table and paste the queries and results into match.iq?>
    >>>>> 
    >>>>> See CALCITE-1935 [2], and my branch [3].>
    >>>>> 
    >>>>> I have cherry-picked commits from Zhiqiang He’s branch [4] into my 
branch, so this will be a joint effort when it is finished.>
    >>>>> 
    >>>>> Julian>
    >>>>> 
    >>>>> [1] 
https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java
 
<https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java><https://github.com/julianhyde/calcite/blob/1935-match-recognize/core/src/test/java/org/apache/calcite/runtime/AutomatonTest.java%3e>>
    >>>>> 
    >>>>> [2] https://issues.apache.org/jira/browse/CALCITE-1935 
<https://issues.apache.org/jira/browse/CALCITE-1935><https://issues.apache.org/jira/browse/CALCITE-1935%3e>>
    >>>>> 
    >>>>> [3] https://github.com/julianhyde/calcite/tree/1935-match-recognize/ 
<https://github.com/julianhyde/calcite/tree/1935-match-recognize/><https://github.com/julianhyde/calcite/tree/1935-match-recognize/%3e>>
    >>>>> 
    >>>>> [4] 
https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3 
<https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3><https://github.com/Zhiqiang-He/calcite/tree/calcite-1935-MR-Implementation3%3e>>
    >>>>> 
    >>>>> 
    >>>>>> On Nov 21, 2018, at 8:45 AM, Julian Feinauer 
<[email protected]<mailto:[email protected]>> wrote:>
    >>>>>>> 
    >>>>>> Sorry, this is an old mail which got sent accidentally again by my 
mail program.>
    >>>>>> Please ignore this and excuse this.>
    >>>>>>> 
    >>>>>> Julian>
    >>>>>>> 
    >>>>>> Am 21.11.18, 16:34 schrieb "Julian Feinauer" 
<[email protected]<mailto:[email protected]>>:>
    >>>>>>> 
    >>>>>> Hi Julian,>
    >>>>>>> 
    >>>>>> I decided to reply to this (old) email, because here some facts are 
noted.>
    >>>>>> Funnily, Apache Flink released their MATCH_RECOGNIZE Implementation 
yesterday.>
    >>>>>>> 
    >>>>>> So I recall that you and Zhigiang He did something on this.>
    >>>>>> I would like to have such a feature in Calcite (as stated in the 
other mail) and could try to go into this a bit with a colleague of mine and 
give a bit of support on this topic (In fact, it sounds like fun to us…).>
    >>>>>> Perhaps theres also the chance to learn something from Flinks 
implementation, as you already had some contacts with them, I think?>
    >>>>>>> 
    >>>>>> Best>
    >>>>>> Julian>
    >>>>>>> 
    >>>>>> On 2018/07/23 17:53:57, Julian Hyde 
<[email protected]<mailto:[email protected]>> wrote:>
    >>>>>>> For quite a while we have had partial support for MATCH_RECOGNIZE. 
We support it in the parser and validator, but there is no runtime 
implementation. It’s a shame, because MATCH_RECOGNIZE is an incredibly powerful 
SQL feature for both traditional SQL (it’s in Oracle 12c) and for continuous 
query (aka complex event processing - CEP).>>
    >>>>>>>> 
    >>>>>>> I figure it’s time to change that. My plan is to implement it 
incrementally, getting simple queries working to start with, then allow people 
to add more complex queries.>>
    >>>>>>>> 
    >>>>>>> In a dev branch [1], I’ve added a method Enumerables.match[2]. The 
idea is that if you supply an Enumerable of input data, a finite state machine 
to figure out when a sequence of rows makes a match (represented by a 
transition function: (state, row) -> state), and a function to convert a 
matched set of rows to a set of output rows. The match method is fairly 
straightforward, and I almost have it finished.>>
    >>>>>>>> 
    >>>>>>> The complexity is in generating the finite state machine, emitter 
function, and so forth.>>
    >>>>>>>> 
    >>>>>>> Can someone help me with this task? If your idea of fun is 
implementing database algorithms, this is about as much fun as it gets. You 
learned about finite state machines in college - this is your chance to 
actually write one!>>
    >>>>>>>> 
    >>>>>>> This might be a good joint project with the Flink community. I know 
Flink are thinking of implementing CEP, and the algorithm we write here could 
be shared with Flink (for use via Flink SQL or via the Flink API).>>
    >>>>>>>> 
    >>>>>>> Julian>>
    >>>>>>>> 
    >>>>>>> [1] 
https://github.com/julianhyde/calcite/commits/1935-match-recognize 
<https://github.com/julianhyde/calcite/commits/1935-match-recognize>><https://github.com/julianhyde/calcite/commits/1935-match-recognize%3e%3e>>
    >>>>>>>> 
    >>>>>>> [2] 
https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73
 
<https://github.com/julianhyde/calcite/commit/4dfaf1bbee718aa6694a8ce67d829c32d04c7e87#diff-8a97a64204db631471c563df7551f408R73>>>
    >>>>>>> 
    >>>>>>> 
    >>>>> 
    >>>>> 
    >>>> 
    >>>> 
    >>> 
    >>> 
    >>> 
    >> 
    >> 
    >> 
    > 
    > 
    >

Re: MATCH_RECOGNIZE

Reply via email to