For the purpose of full disclosure, I think Scala offers a much more efficient pattern matching paradigm. Using nPath is like using assembler to program distributed systems. Cannot tell much here today, but the pattern would look like:
| def matchSessions(h: Seq[Session[PageView]], id: String, p: Seq[PageView]) : Seq[Session[PageView]] = { | p match { | case Nil => Nil | case PageView(ts1, "company.com>homepage") :: PageView(ts2, "company.com>plus>products landing") :: tail if ts2 > ts1 + 600 => | matchSessions(h, id, tail).+:(new Session(id, p)) | case _ => matchSessions(h, id, p.tail) | } Look for Scala case statements with guards and upcoming book releases. http://docs.scala-lang.org/tutorials/tour/pattern-matching https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <henr...@gmail.com> wrote: > fwiw Apache Flink just added CEP. Queries are constructed programmatically > rather than in SQL, but the underlying functionality is similar. > > https://issues.apache.org/jira/browse/FLINK-3215 > > On 1 March 2016 at 08:19, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Herman, >> >> Thank you for your reply! >> This functionality usually finds its place in financial services which >> use CEP (complex event processing) for correlation and pattern matching. >> Many commercial products have this including Oracle and Teradata Aster Data >> MR Analytics. I do agree the syntax a bit awkward but after you understand >> it, it is actually very compact for expressing something that is very >> complex. Esper has this feature partially implemented ( >> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html >> ). >> >> I found the Teradata Analytics documentation best to describe the usage >> of it. For example (note npath is similar to match_recognize): >> >> SELECT last_pageid, MAX( count_page80 ) >> FROM nPath( >> ON ( SELECT * FROM clicks WHERE category >= 0 ) >> PARTITION BY sessionid >> ORDER BY ts >> PATTERN ( 'A.(B|C)*' ) >> MODE ( OVERLAPPING ) >> SYMBOLS ( pageid = 50 AS A, >> pageid = 80 AS B, >> pageid <> 80 AND category IN (9,10) AS C ) >> RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid, >> COUNT ( * OF B ) AS count_page80, >> COUNT ( * OF ANY ( A,B,C ) ) AS count_any ) >> ) >> WHERE count_any >= 5 >> GROUP BY last_pageid >> ORDER BY MAX( count_page80 ) >> >> The above means: >> Find user click-paths starting at pageid 50 and passing exclusively >> through either pageid 80 or pages in category 9 or category 10. Find the >> pageid of the last page in the path and count the number of times page 80 >> was visited. Report the maximum count for each last page, and sort the >> output by the latter. Restrict to paths containing at least 5 pages. Ignore >> pages in the sequence with category < 0. >> >> If this query is written in pure SQL (if possible at all), it requires >> several self-joins. The interesting thing about this feature is that it >> integrates SQL+Streaming+ML in one (perhaps potentially graph too). >> >> Best Regards, >> >> Jerry >> >> >> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier < >> hvanhov...@questtec.nl> wrote: >> >>> Hi Jerry, >>> >>> This is not on any roadmap. I (shortly) browsed through this; and this >>> looks like some sort of a window function with very awkward syntax. I think >>> spark provided better constructs for this using dataframes/datasets/nested >>> data... >>> >>> Feel free to submit a PR. >>> >>> Kind regards, >>> >>> Herman van Hövell >>> >>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chiling...@gmail.com>: >>> >>>> Hi Spark developers, >>>> >>>> Will you consider to add support for implementing "Pattern matching in >>>> sequences of rows"? More specifically, I'm referring to this: >>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf >>>> >>>> This is a very cool/useful feature to pattern matching over live >>>> stream/archived data. It is sorted of related to machine learning because >>>> this is usually used in clickstream analysis or path analysis. Also it is >>>> related to streaming because of the nature of the processing (time series >>>> data mostly). It is SQL because there is a good way to express and optimize >>>> the query. >>>> >>>> Best Regards, >>>> >>>> Jerry >>>> >>> >>> >> > -- Alex Kozlov (408) 507-4987 (650) 887-2135 efax ale...@gmail.com