On 6/25/23 14:05, Tatsuo Ishii wrote:
Attached is a PoC patch to implement "Row pattern recognition" (RPR)
in SQL:2016 (I know SQL:2023 is already out, but I don't have access
to it). Actually SQL:2016 defines RPR in two places[1]:

     Feature R010, “Row pattern recognition: FROM clause”
     Feature R020, “Row pattern recognition: WINDOW clause”

The patch includes partial support for R020 part.

I have been dreaming of and lobbying for someone to pick up this feature. I will be sure to review it from a standards perspective and will try my best to help with the technical aspect, but I am not sure to have the qualifications for that.

THANK YOU!

> (I know SQL:2023 is already out, but I don't have access to it)

If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead of just its technical specification.

https://www.iso.org/standard/78936.html

- What is RPR?

RPR provides a way to search series of data using regular expression
patterns. Suppose you have a stock database.

CREATE TABLE stock (
        company TEXT,
        tdate DATE,
        price BIGINT);

You want to find a "V-shaped" pattern: i.e. price goes down for 1 or
more days, then goes up for 1 or more days. If we try to implement the
query in PostgreSQL, it could be quite complex and inefficient.

RPR provides convenient way to implement the query.

SELECT company, tdate, price, rpr(price) OVER w FROM stock
  WINDOW w AS (
  PARTITION BY company
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  PATTERN (START DOWN+ UP+)
  DEFINE
   START AS TRUE,
   UP AS price > PREV(price),
   DOWN AS price < PREV(price)
);

"PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3
"row pattern variables" namely START, UP and DOWN. They are associated
with logical expressions namely "TRUE", "price > PREV(price)", and
"price < PREV(price)". Note that "PREV" function returns price column
in the previous row. So, UP is true if price is higher than previous
day. On the other hand, DOWN is true if price is lower than previous
day.  PATTERN uses the row pattern variables to create a necessary
pattern.  In this case, the first row is always match because START is
always true, and second or more rows match with "UP" ('+' is a regular
expression representing one or more), and subsequent rows match with
"DOWN".

Here is the sample output.

  company  |   tdate    | price | rpr
----------+------------+-------+------
  company1 | 2023-07-01 |   100 |
  company1 | 2023-07-02 |   200 |  200 -- 200->150->140->150
  company1 | 2023-07-03 |   150 |  150 -- 150->140->150
  company1 | 2023-07-04 |   140 |
  company1 | 2023-07-05 |   150 |  150 -- 150->90->110->130
  company1 | 2023-07-06 |    90 |
  company1 | 2023-07-07 |   110 |
  company1 | 2023-07-08 |   130 |
  company1 | 2023-07-09 |   120 |
  company1 | 2023-07-10 |   130 |

rpr shows the first row if all the patterns are satisfied. In the
example above 200, 150, 150 are the cases.  Other rows are shown as
NULL. For example, on 2023-07-02 price starts with 200, then goes down
to 150 then 140 but goes up 150 next day.

I don't understand this. RPR in a window specification limits the window to the matched rows, so this looks like your rpr() function is just the regular first_value() window function that we already have?

As far as I know, only Oracle implements RPR (only R010. R020 is not
implemented) among OSS/commercial RDBMSs. There are a few DWH software
having RPR. According to [2] they are Snowflake and MS Stream
Analytics. It seems Trino is another one[3].

I thought DuckDB had it already, but it looks like I was wrong.

- Note about the patch

The current implementation is not only a subset of the standard, but
is different from it in some places. The example above is written as
follows according to the standard:

SELECT company, tdate, startprice OVER w FROM stock
  WINDOW w AS (
  PARTITION BY company
  MEASURES
   START.price AS startprice
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  PATTERN (START DOWN+ UP+)
  DEFINE
   START AS TRUE,
   UP AS UP.price > PREV(UP.price),
   DOWN AS DOWN.price < PREV(DOWN.price)
);

Notice that rpr(price) is written as START.price and startprice in the
standard. MEASURES defines variable names used in the target list used
with "OVER window". As OVER only allows functions in PostgreSQL, I had
to make up a window function "rpr" which performs the row pattern
recognition task.  I was not able to find a way to implement
expressions like START.price (START is not a table alias). Any
suggestion is greatly appreciated.

As in your example, you cannot have START.price outside of the window specification; it can only go in the MEASURES clause. Only startprice is allowed outside and it gets its qualification from the OVER. Using w.startprice might have been better but that would require window names to be in the same namespace as range tables.

This currently works in Postgres:

  SELECT RANK() OVER w
  FROM (VALUES (1)) AS w (x)
  WINDOW w AS (ORDER BY w.x);

The differences from the standard include:

o MEASURES is not supported
> o FIRST, LAST, CLASSIFIER are not supported
> o PREV/NEXT in the standard accept more complex arguments
> o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified)

Okay, for now.

o SUBSET is not supported

Is this because you haven't done it yet, or because you ran into problems trying to do it?

o Regular expressions other than "+" are not supported

This is what I had a hard time imagining how to do while thinking about it. The grammar is so different here and we allow many more operators (like "?" which is also the standard parameter symbol). People more expert than me will have to help here.

o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is
   not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the
   standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I
   have a plan to implement AFTER MATCH SKIP PAST LAST ROW though.

In this case, we should require the user to specify AFTER MATCH SKIP TO NEXT ROW so that behavior doesn't change when we implement the standard default. (Your patch might do this already.)

o Aggregate functions associated with window clause using RPR do not respect RPR

I do not understand what this means.

It seems RPR in the standard is quite complex. I think we can start
with a small subset of RPR then we could gradually enhance the
implementation.

I have no problem with that as long as we don't paint ourselves into a corner.

Comments and suggestions are welcome.

I have not looked at the patch yet, but is the reason for doing R020 before R010 because you haven't done the MEASURES clause yet?

In any case, I will be watching this with a close eye, and I am eager to help in any way I can.
--
Vik Fearing



Reply via email to