Hi hackers,
I'd like to propose an extension to the Row Pattern Recognition (RPR)
absorption mechanism to handle anchored patterns (those starting with
START or similar prefix elements).
== Background ==
The current RPR implementation includes an absorption optimization that
reduces time complexity from O(n²) to O(n) for unbounded quantifiers
like A+. When multiple match contexts are at the same ABSORBABLE element,
the context that started earlier (having matched more rows) absorbs
later-starting contexts, eliminating redundant processing.
For example, with pattern "A+ B":
- Ctx1 starts at row 0, matching A
- Ctx2 starts at row 1, matching A
- Ctx1 always has more A's → Ctx2 is absorbed (removed)
- Result: Only 1 context maintained, O(n) complexity
== The Problem ==
This absorption mechanism breaks down for anchored patterns like
"START SECOND A+ B". The PREFIX elements (START, SECOND) block
absorption because contexts at different PREFIX positions cannot
be meaningfully compared.
Without absorption, we regress to O(n²) behavior as contexts
accumulate without being eliminated.
== Proposed Solution: Original/Alternate Simultaneous Progression ==
The key insight is to track two logical "paths" within each context:
1. Original: Starts at Element 0 (PREFIX + BODY)
- can_absorb = true
- Can be used as matchedState (actual match result)
2. Alternate: Starts at Element 2 (BODY only, skipping PREFIX)
- can_be_absorbed = true
- Used only for absorption eligibility checking
- Cannot be used as matchedState
Both paths process the same row data simultaneously. The alternate
path serves as a "shadow" that indicates whether the context is
eligible for absorption.
== Absorption Conditions ==
For Ctx1 to absorb Ctx2, all conditions must be met:
1. Ctx1.can_absorb = true (original not outside pattern region)
2. Ctx2.can_be_absorbed = true (alternate still alive in BODY)
3. Ctx2's original has reached ABSORBABLE (PREFIX matching confirmed)
4. Ctx1 started before Ctx2
Note: No count comparison is needed. The alternate being alive
confirms absorption eligibility; the earlier-starting context
always wins.
== Why Wait for Ctx2's Original to Reach ABSORBABLE? ==
We cannot absorb Ctx2 immediately when it starts. We must wait
until Ctx2's original successfully passes through PREFIX elements
(START, SECOND) and reaches the ABSORBABLE point (A+). This
confirms that Ctx2's PREFIX matching succeeded before we remove it.
== Why Absorption is Safe (Common Fate) ==
Once both contexts reach the same ABSORBABLE element (A+), they
share a "common fate":
- Same element position, same future data stream
- Same choices ahead (continue A+ or transition to B)
- If Ctx1 succeeds → Ctx2 would have succeeded too (but shorter match)
- If Ctx1 fails → Ctx2 would have failed too (nothing lost)
- The only difference is in the past (rows already matched),
which doesn't affect future matching decisions
This property guarantees that absorption is safe: we never lose
a potential match by absorbing a later-starting context into an
earlier one.
== Flag Design ==
Compile-time (Element flags):
- RPR_ELEM_ABSORBABLE: Unchanged (marks WHERE comparison happens)
- RPR_ELEM_ABSORBABLE_BRANCH_PREFIX: New (PREFIX region, blocks
absorption)
- RPR_ELEM_ABSORBABLE_BRANCH_BODY: New (BODY region, allows absorption)
Runtime (State variables):
- can_absorb: "Can absorb others" (original=true, alternate=false)
- can_be_absorbed: "Can be absorbed" (PREFIX→false, BODY→true)
Both flags are monotonic: once false, they cannot become true again.
== Complexity Analysis ==
With this extension:
- Maximum concurrent contexts: PREFIX_length + 1
- For "START SECOND A+ B": max 3 contexts at any time
- Overall complexity: O(n) maintained
Example flow for "START SECOND A+ B" with data r0-r4(A,A,A,A,B):
Row 0: Ctx1 starts [1]
Row 1: Ctx2 starts (cannot absorb: PREFIX not confirmed) [2]
Row 2: Ctx3 starts [3]
Row 3: Ctx4 starts, Ctx2 absorbed (PREFIX confirmed) [3]
Row 4: Ctx1 matches B, Ctx3-5 discarded [1]
Note: This is not an immediate priority. Stabilizing the current
RPR patch comes first. I'm sharing this design early to get feedback
and ensure the approach is sound before implementation.
I have a working design document with detailed state transition
rules and examples. Happy to share more details.
Thoughts?
--
Best regards,
Henson