Hi hackers,

I ran a cross-validation test comparing PostgreSQL RPR (current patch) and
Trino MATCH_RECOGNIZE to verify both result correctness and performance
characteristics across different scales.

The dataset consists of two segments where rows are arranged sequentially
as:

A → B → C → D

Each segment contains a fixed distribution of categories A/B/C/D. Category
E does not exist in the dataset and is used to test match failure behavior.

Test scales ranged from 20,000 to 100,000 rows.

All tests were executed using:

AFTER MATCH SKIP PAST LAST ROW

Test Environment

Item        | PostgreSQL              | Trino
------------+-------------------------+-----------------------
Version     | 19devel (RPR patch)    | 471
Runtime     | Local                  | Local Docker container
Platform    | Linux x86_64           | Linux x86_64
RPR syntax  | WINDOW clause          | MATCH_RECOGNIZE

Dataset Structure

Each segment contains sequential categories:

A : ~1/3 of rows
B : ~1/3 of rows
C : ~1/3 of rows
D : 1 row

Example at 1x scale:

Segment size : 10,000 rows
Total rows : 20,000

Test Cases

Test 1
Pattern:
PATTERN (A+ B+ C+ D)

Expected result:
1 match per segment (2 total)

Reason:
Data is arranged exactly as A → B → C → D.

Test 2
Pattern:
PATTERN (A+ B+ C+ E)

Expected result:
0 rows (no match)

Reason:
Category E does not exist in the dataset.

Correctness Results

Across all scales both systems returned identical results.

Test 1 → 1 match per segment (2 total)
Test 2 → 0 rows

Performance Results

Scale | Total Rows | PG Test1 | PG Test2 | Trino Test1 | Trino Test2
------+------------+----------+----------+-------------+-------------
1x    |  20,000    | 19 ms    | 17 ms    | ~0.3 s      | ~358 s
2x    |  40,000    | 37 ms    | 37 ms    | ~0.8 s      | ~1,364 s
3x    |  60,000    | 57 ms    | 51 ms    | ~1.5 s      | ~4,424 s
4x    |  80,000    | 73 ms    | 68 ms    | ~2.3 s      | ~9,989 s
5x    | 100,000    | 99 ms    | 92 ms    | ~3.3 s      | ~20,014 s

Note:
Trino measurements are adjusted to remove JVM startup overhead (~2.4s
measured via SELECT 1).

Observations

Match success (Test 1)

Both systems scale approximately linearly because the entire segment is
consumed in a single pass after a successful match.

PostgreSQL shows lower absolute latency.

Match failure (Test 2)

PostgreSQL maintains near-linear scaling.

This appears to be due to its NFA implementation combined with Context
Absorption, which discards redundant matching contexts when they reach the
same NFA state but start later in the input.

Example:

Ctx1 start=0 length=2
Ctx2 start=1 length=1

Ctx1 subsumes Ctx2, so Ctx2 is discarded.

This keeps the number of active contexts small and prevents quadratic
growth.

Measured scaling:

17 ms → 92 ms (1x → 5x)

which is consistent with O(n).

Trino also uses an NFA approach but appears to lack Context Absorption or a
similar optimization.

As a result, it explores matching contexts from all possible start
positions, leading to rapidly increasing backtracking cost.

Measured scaling:

358 s → 20,014 s (1x → 5x)

This exceeds theoretical O(n²) growth and appears closer to O(n²)–O(n³) in
practice, likely compounded by JVM memory and GC overhead.

Summary

Correctness

PostgreSQL RPR and Trino MATCH_RECOGNIZE produce identical matching results
across all tested scales.

Performance

Match success:
Both systems scale roughly linearly.

Match failure:
PostgreSQL maintains O(n) scaling while Trino shows O(n²) or worse behavior.

At the largest tested scale:

PostgreSQL : ~92 ms
Trino : ~20,014 s (≈ 5.6 hours)

PostgreSQL is therefore approximately 217,000× faster in this scenario.

Conclusion

Both systems use NFA-based pattern matching.

The key difference appears to be Context Absorption in PostgreSQL, which
removes redundant matching contexts and guarantees linear scaling even when
patterns fail to match.

This optimization prevents the non-linear performance degradation typically
associated with row pattern recognition.

Best regards

SungJun

2026년 3월 9일 (월) PM 2:22, Tatsuo Ishii <[email protected]>님이 작성:

> Hi Henson,
>
> >> Excellnt findings!  BTW, I realized that we cannot use $1 of function
> >> in PATTERN clause like: A{$1}.
> >>
> >> ERROR:  42601: syntax error at or near "$1"
> >> LINE 10:         PATTERN (A{$1})
> >>                             ^
> >> LOCATION:  scanner_yyerror, scan.l:1211
> >>
> >> Should we document somewhere?
> >>
> >
> > The PATTERN quantifier {n} only accepts Iconst (integer literal) in the
> > grammar.  When a host variable or function parameter is used (e.g.,
> > A{$1}), the user gets a generic syntax error.
>
> Ok.
>
> > Oracle accepts broader syntax and validates later, producing an error
> > at a later stage rather than a syntax error at parse time.
> >
> > PostgreSQL itself already has precedent for this pattern -- in fact,
> > within the same window clause, frame offset (ROWS/RANGE/GROUPS) accepts
> > a_expr in the grammar and then rejects variables in parse analysis via
> > transformFrameOffset() -> checkExprIsVarFree().
> >
> > I'd lean against documenting this.  The SQL standard already defines
> > the quantifier bound as <unsigned integer literal>, so there is nothing
> > beyond the standard to call out, and documenting what is *not* allowed
> > tends to raise questions that wouldn't otherwise occur to users.
> >
> > Rather, I think accepting a broader grammar and validating later would
> > be the more appropriate response, producing a descriptive error like:
> >
> >   "argument of bounded quantifier must be an integer literal"
> >
> > I can either include this in the current patch set or handle it as a
> > separate follow-up after the main series is committed.  What do you
> > think?
>
> I think handing it as a separate follow-up after the commit is enough
> unless other developers complain.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>

Reply via email to