Hello folks, I am in the process of accelerating down the rabbit hole of regex internals. Something that came up during my reading, is that a POSIX compliant regex engine ought to always prefer the longest possible match, when multiple matches are possible beginning from the same location in the string. [1]
I wasn't sure that that was how our regex engine worked, and indeed, on checking the manual [2] I found that our regex engine uses a strange sort of "inductive greediness" to determine whether the longest or the shortest possible match ought to be preferred. The greediness of individual particles in the regex are taken into account, and at the top level the entire expression is concluded to be either greedy, or non-greedy. I'll admit that this is a pretty obscure point, but we do appear to be in direct violation of POSIX here. I am getting the impression that our engine works this way to route around some of the performance issues that can arise in trying out every possible match with an NFA-style engine. I find it a bit unfortunate that POSIX is so unambiguous about this, while our engine's treatment is, well ... quite arcane by comparison. At minimum, I think we should be more explicit in the manual that this behaviour flips POSIX the proverbial bird. Several paragraphs south, there is a sentence reading "Incompatibilities of note include ... the longest/shortest-match (rather than first-match) matching semantics", but in the context it seems as though the author is talking about an incompatibility with Perl, not with POSIX. Thoughts? Cheers, BJ [1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html [2] http://www.postgresql.org/docs/9.0/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers