Hi Tatsuo,

As promised, here is a summary of the open items and my proposed
plan for the remaining work.


1. FIXME bugs (to fix within the patch)

These are correctness issues that should be resolved before commit.

[FIXME] Cycle prevention insufficient

  The cycle prevention condition (count == 0 && min == 0) is
  insufficient for patterns like (A*)* where cycles occur at
  count > 0.  Currently relies on implicit duplicate detection
  in nfa_add_state_unique.  An explicit cycle detection logic
  needs to be added.

[FIXME] altPriority -- incomplete lexical ordering

  altPriority tracks only the last ALT choice, so repeated or
  nested ALTs like (A|B)+ cannot correctly implement SQL standard
  lexical ordering.  We may need to maintain an internal classifier
  that tracks the entire ALT path to correctly determine lexical
  priority across repeated alternations.  I'm considering an
  elemIdx-based classifier that records the sequence of element
  positions through ALT branches -- from elemIdx we can derive
  altPriority, varId, and variable name, so it carries more
  information than a varId-based approach, which cannot distinguish
  branches when the same variable appears in multiple ALT
  alternatives.  Since the classifier sequence grows with each
  ALT choice, state forks will need an efficient sharing structure
  (e.g., chunk tree with refcounting) to avoid memory explosion.
  The details still need refinement and verification, so it may
  take some time.


2. Test and documentation (within the patch)

Node serialization 0% coverage

  outfuncs.c/readfuncs.c/equalfuncs.c.  You suggested leaving
  this until reluctant quantifiers are implemented.  This depends
  on the scope decision in section 4 -- if reluctant quantifiers
  are included, we should address coverage then.

Test reorganization

  - Remove tests from rpr.sql that duplicate coverage in rpr_base.sql
    and rpr_nfa.sql
  - Restructure tests around SQL-centric scenarios rather than
    internal implementation details
  - Add real-world scenario tests (e.g., stock V-shape, sessionization)

Documentation review

  - Review and enhance the existing SGML documentation
  - Ensure all implemented features are accurately documented


3. Performance optimizations (separate follow-up patches)

These require thorough analysis and research, but since they are
pure performance improvements with no compatibility impact, they
can be submitted as independent patches afterward.

State pruning optimization

  Active NFA states with lower lexical priority than an already
  confirmed match candidate could be pruned early.  Correctness
  is hard to guarantee for all pattern combinations.  Documented
  as a future research item.

Anchored pattern absorption optimization

  PREFIX elements (e.g., START in "START A+ B") block absorption,
  causing O(n^2) regression for anchored patterns.  A draft design
  exists: track an alternate "shadow" path that skips PREFIX and
  starts at the BODY region, enabling absorption eligibility checks
  while the original path processes PREFIX normally.  This keeps
  concurrent contexts bounded to PREFIX_length + 1, maintaining
  O(n) complexity.  The design needs further refinement before
  implementation.


4. Unimplemented features

The following R020 features are listed in the commit message as not
yet implemented.  Could you let me know which ones you'd like to
include in this patch versus deferring to follow-up patches?

  Requires per-match state tracking infrastructure:
  - MEASURE -- new clause, pattern variable references in target list,
    and per-match aggregate evaluation within the NFA executor
  - CLASSIFIER -- requires per-row variable binding in match state;
    the altPriority fix in section 1 would provide part of this
    infrastructure

  Quantifier/pattern extensions:
  - Reluctant quantifiers (*? etc.) -- parsing exists, runtime error
  - {- -} (exclusion)
  - Empty pattern -- A{0}, () empty pattern.  Should we implement
    A{0} in this patch and defer empty match semantics (unmatch vs
    empty match distinction) to when MEASURE is implemented?
    A{0,0}: standard forbids, Oracle/Snowflake allow.
    Should we follow the standard and reject A{0,0}?
  - PERMUTE -- can be expanded to alternations at parse time

  Navigation/control:
  - SEEK
  - AFTER MATCH SKIP TO / FIRST / LAST
    (only SKIP PAST LAST ROW and SKIP TO NEXT ROW are supported)

  Row pattern functions:
  - SUBSET -- variable grouping for aggregate evaluation
  - FIRST, LAST

  Out of scope:
  - MATCH_RECOGNIZE (R010)


I'd like your thoughts on the following scope questions:
  - Reluctant quantifiers: in scope for this patch?
  - A{0}: proceed with implementation?
  - () empty pattern: also in scope, or defer?
  - A{0,0}: reject per standard, or allow like Oracle/Snowflake?

There are quite a few items listed above, and I'd like to
make sure I'm working on the right things in the right order.
If you could give me some guidance on which items to prioritize
and how you'd like to see these addressed, that would be very
helpful.

Best regards,
Henson

Reply via email to