> Maybe I underestimated the utility of ^ and $. The definition seems > intricate. I thought about adding a combinator for matching newline but > now think that would lead to wrong start and end positions. For example > the start position of the matching substring for ^a in "a\na" should > be 2 not 1, right? Or is it 0 although there is no newline at the > beginning?
The first "a" would match with indexes (0,1) and the second "a" would match with indexes (1,2). > > Is there a page with examples that show how ^ and $ should behave exactly? > Without REG_NEWLINE the meanings are: . matches any single character (though note that handling of a zero byte is impossible for C style strings for a different reason). ^ is an assertion that instead of being AlwaysTrue (eps)or AlwaysFalse (noMatch) is true before any characters have been accepted and false afterward. $ is an assertion that is true only when there are no more characters to match and false before this. With REG_NEWLINE the meanings are: . matches any single character EXCEPT '\n' newline (ASCII 10, I think). ^ is true before any characters have been matched and true right after a newline has been matched, else false. ^ is true when there are no more characters to match and true if the next character to match is a newline, else false. Let 'a' and 'b' and 'c' be some complicated regular expressions that cannot accept a newline with REG_NEWLINE enabled: ^$ finds blank lines, the indexes between newlines or between a newline and the start or end of the text. ^a$ requires 'a' to exactly fill a line and the captured string has no newlines. A more complicated use, perhaps as part of a crazy parser: "(a(\n)?)(^|b)(c|$)" has 'a' much some text and perhaps the newline. If the newline was there then the ^ matches and b might be skipped, otherwise b must be used. The match ends with '(c|$)' is thus either starting the new line or trailing b. And (c|$) can avoid matching 'c' if the next character is a newline. Note that the regular expression "(^|[aA])" has a non-trivial "can_accept_empty" property: it can sometimes accept empty. And if you are recording parenthetical captures then "(^)?" is subtle. When ^ is true the (^) succeeds like () and when it is false it does not. This inserts a test into the pattern that can be checked later. And "((^$)|(^)|($))" is worse: it does not always succeed and which sub-pattern gets captured depends on the presence of one or two newlines. In "((^)|(^$))" it is impossible for (^$) to be used since the first (^) will always be favored by the POSIX rules. Similarly "(()|(^))" will never use (^). A small chunk of regex-tdfa sifts through the possible ways to accept 0 characters for each node in the parse-tree and keeps an ordered list of sets of assertions to check, and cleans outs those that are logically excluded. Slightly more useful anchors are added in Perl/PCRE: > ANCHORS AND SIMPLE ASSERTIONS > \b word boundary > \B not a word boundary > ^ start of subject > also after internal newline in multiline mode > \A start of subject > $ end of subject > also before newline at end of subject > also before internal newline in multiline mode > \Z end of subject > also before newline at end of subject > \z end of subject > \G first matching position in subject I added \b \B as above, and added \` \' to be like \A and \Z above, and added \< and \> to be beginning and end of word assertions. With enough assertions and negated assertions one could level up to using a binary decision diagram to express when a sub-pattern can accept 0 characters. Ville's libtre gets this wrong: > Searched text: "searchme" > Regex pattern: "((s^)|(s)|(^)|($)|(^.))*" > Expected output: "(0,1)(0,1)(-1,-1)(0,1)(-1,-1)(-1,-1)(-1,-1)" > Actual result : "(0,1)(0,1)(-1,-1)(0,1)(1,1)(1,1)(-1,-1)" And sometimes very wrong: > Searched text: "searchme" > Regex pattern: "s(^|())e" > Expected output: "(0,2)(1,1)(1,1)" > Actual result : "NOMATCH" Cheers, Dr. Chris Kuklewicz _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
