Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

Reynold Xin Tue, 01 Mar 2016 13:46:29 -0800

There are definitely pros and cons for Scala vs SQL-style CEP. Scala might
be more powerful, but the target audience is very different.


How much usage is there for a CEP style SQL syntax in practice? I've never
seen it coming up so far.



On Tue, Mar 1, 2016 at 9:35 AM, Alex Kozlov <ale...@gmail.com> wrote:

> Looked at the paper: while we can argue on the performance side, I think
> semantically the Scala pattern matching is much more expressive.  The time
> will decide.
>
> On Tue, Mar 1, 2016 at 9:07 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Alex,
>>
>> We went through this path already :) This is the reason we try other
>> approaches. The recursion makes it very inefficient for some cases.
>> For details, this paper describes it very well:
>> https://people.cs.umass.edu/%7Eyanlei/publications/sase-sigmod08.pdf
>> which is the same paper references in Flink ticket.
>>
>> Please let me know if I overlook something. Thank you for sharing this!
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Tue, Mar 1, 2016 at 11:58 AM, Alex Kozlov <ale...@gmail.com> wrote:
>>
>>> For the purpose of full disclosure, I think Scala offers a much more
>>> efficient pattern matching paradigm.  Using nPath is like using assembler
>>> to program distributed systems.  Cannot tell much here today, but the
>>> pattern would look like:
>>>
>>>      |     def matchSessions(h: Seq[Session[PageView]], id: String, p:
>>> Seq[PageView]) :
>>>
>>> Seq[Session[PageView]] = {    |       p match {
>>>
>>>      |         case Nil => Nil
>>>
>>>      |         case PageView(ts1, "company.com>homepage") ::
>>> PageView(ts2,
>>>
>>> "company.com>plus>products landing") :: tail if ts2 > ts1 + 600 =>
>>>
>>>      |           matchSessions(h, id, tail).+:(new Session(id, p))
>>>
>>>      |         case _ => matchSessions(h, id, p.tail)
>>>
>>>      |       }
>>>
>>> Look for Scala case statements with guards and upcoming book releases.
>>>
>>> http://docs.scala-lang.org/tutorials/tour/pattern-matching
>>>
>>> https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html
>>>
>>> On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <henr...@gmail.com
>>> > wrote:
>>>
>>>> fwiw Apache Flink just added CEP. Queries are constructed
>>>> programmatically rather than in SQL, but the underlying functionality is
>>>> similar.
>>>>
>>>> https://issues.apache.org/jira/browse/FLINK-3215
>>>>
>>>> On 1 March 2016 at 08:19, Jerry Lam <chiling...@gmail.com> wrote:
>>>>
>>>>> Hi Herman,
>>>>>
>>>>> Thank you for your reply!
>>>>> This functionality usually finds its place in financial services which
>>>>> use CEP (complex event processing) for correlation and pattern matching.
>>>>> Many commercial products have this including Oracle and Teradata Aster 
>>>>> Data
>>>>> MR Analytics. I do agree the syntax a bit awkward but after you understand
>>>>> it, it is actually very compact for expressing something that is very
>>>>> complex. Esper has this feature partially implemented (
>>>>> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
>>>>> ).
>>>>>
>>>>> I found the Teradata Analytics documentation best to describe the
>>>>> usage of it. For example (note npath is similar to match_recognize):
>>>>>
>>>>> SELECT last_pageid, MAX( count_page80 )
>>>>>  FROM nPath(
>>>>>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>>>>>  PARTITION BY sessionid
>>>>>  ORDER BY ts
>>>>>  PATTERN ( 'A.(B|C)*' )
>>>>>  MODE ( OVERLAPPING )
>>>>>  SYMBOLS ( pageid = 50 AS A,
>>>>>            pageid = 80 AS B,
>>>>>            pageid <> 80 AND category IN (9,10) AS C )
>>>>>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>>>>>           COUNT ( * OF B ) AS count_page80,
>>>>>           COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>>>>>  )
>>>>>  WHERE count_any >= 5
>>>>>  GROUP BY last_pageid
>>>>>  ORDER BY MAX( count_page80 )
>>>>>
>>>>> The above means:
>>>>> Find user click-paths starting at pageid 50 and passing exclusively
>>>>> through either pageid 80 or pages in category 9 or category 10. Find the
>>>>> pageid of the last page in the path and count the number of times page 80
>>>>> was visited. Report the maximum count for each last page, and sort the
>>>>> output by the latter. Restrict to paths containing at least 5 pages. 
>>>>> Ignore
>>>>> pages in the sequence with category < 0.
>>>>>
>>>>> If this query is written in pure SQL (if possible at all), it requires
>>>>> several self-joins. The interesting thing about this feature is that it
>>>>> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>>
>>>>> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
>>>>> hvanhov...@questtec.nl> wrote:
>>>>>
>>>>>> Hi Jerry,
>>>>>>
>>>>>> This is not on any roadmap. I (shortly) browsed through this; and
>>>>>> this looks like some sort of a window function with very awkward syntax. 
>>>>>> I
>>>>>> think spark provided better constructs for this using
>>>>>> dataframes/datasets/nested data...
>>>>>>
>>>>>> Feel free to submit a PR.
>>>>>>
>>>>>> Kind regards,
>>>>>>
>>>>>> Herman van Hövell
>>>>>>
>>>>>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chiling...@gmail.com>:
>>>>>>
>>>>>>> Hi Spark developers,
>>>>>>>
>>>>>>> Will you consider to add support for implementing "Pattern matching
>>>>>>> in sequences of rows"? More specifically, I'm referring to this:
>>>>>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>>>>>
>>>>>>> This is a very cool/useful feature to pattern matching over live
>>>>>>> stream/archived data. It is sorted of related to machine learning 
>>>>>>> because
>>>>>>> this is usually used in clickstream analysis or path analysis. Also it 
>>>>>>> is
>>>>>>> related to streaming because of the nature of the processing (time 
>>>>>>> series
>>>>>>> data mostly). It is SQL because there is a good way to express and 
>>>>>>> optimize
>>>>>>> the query.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Jerry
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Alex Kozlov
>>> (408) 507-4987
>>> (650) 887-2135 efax
>>> ale...@gmail.com
>>>
>>
>>
>
>
> --
> Alex Kozlov
> (408) 507-4987
> (650) 887-2135 efax
> ale...@gmail.com
>

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

Reply via email to