Hi Tatsuo, Here are two incremental patches on top of v43 + our previous two.
nocfbot-0003: Fix ALT lexical ordering via DFS early termination
The altPriority FIXME turned out to have a simple solution. The key
insight is that the NFA state list is already in lexical order from
DFS traversal, so when a state reaches FIN, all remaining states in
the list have worse lexical priority and can be pruned immediately.
This makes the altPriority field unnecessary -- DFS traversal order
itself guarantees correct lexical ordering. The patch removes
altPriority from RPRNFAState entirely and adds early termination
in nfa_advance(): when a new FIN is reached, the remaining states
are freed and processing stops.
This also implements the state pruning optimization I mentioned
earlier -- it falls out naturally from the same mechanism.
Changes:
- Remove altPriority field from RPRNFAState and all call sites
- Add early termination in nfa_advance() on new FIN arrival
- Simplify nfa_add_matched_state() to unconditional replacement
- Fix outer END count increment for quantified VARs in
nfa_advance_var()
- Remove FIXME 1 test cases, add test_alt_lexical_order test
- Add EXPLAIN ANALYZE test verifying early termination statistics
nocfbot-0004: Implement reluctant quantifiers
With the DFS early termination infrastructure from nocfbot-0003,
reluctant quantifiers become straightforward: reverse the DFS
traversal order so that shorter matches are explored first, then
apply the same early termination to stop at the shortest match.
Greedy explores enter/loop before skip/exit; reluctant reverses
this to skip/exit before enter/loop. When the preferred (shorter)
path reaches FIN, the longer path is pruned immediately.
Changes:
- Remove parser error rejecting reluctant quantifier syntax
- Extend tryUnwrapGroup() to propagate reluctant flag when
unwrapping single-child groups: (A)+? -> A+?
- Add reluctant branching in nfa_advance_begin(),
nfa_advance_end(), and nfa_advance_var() with per-branch
early termination
- Add tests in rpr_base.sql, rpr_nfa.sql, and rpr.sql covering
basic reluctant semantics, quantifier boundaries, interaction
with greedy quantifiers, and nested/alternation combinations
> I found reluctant quantifiers are useful but also I don't want to make
> patch sets far bigger. How do you estimate the difficulty and the size
> of the code for reluctant quantifiers?
You asked earlier how I estimated the difficulty of reluctant
quantifiers. It turned out the answer was subtraction, not
addition -- removing altPriority and simplifying the match logic
first made reluctant quantifiers almost trivial to add on top.
Next I plan to work on the remaining FIXME (cycle prevention for
patterns like (A*)*), A{0} implementation, and test reorganization.
Best regards,
Henson
>
nocfbot-0003-DFS-early-termination.patch
Description: Binary data
nocfbot-0004-Implement-reluctant-quantifiers.patch
Description: Binary data
