------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1315 --- Comment #6 from Philip Hazel <[email protected]> 2012-11-09 12:16:22 --- Christoph, Perl-style regular expressions are quite complicated and some of the ways in which they work are not always intuitive. If you have not already read it, I highly recommend Jeffrey Friedl's book "Mastering Regular Expressions" (3rd edition), published by O'Reilly. It discusses many variants, but in particular includes Perl and PCRE. On Thu, 8 Nov 2012, Christoph Anton Mitterer wrote: > > e.g: /a$[^x]b/m matches to a\nb, since $ itself just checks a condition, but > > does not change the character position. > Uhm... not yet sure whether I understand... "a" matches "a". Then "$" is true, because we are now just before a newline in multiline mode (the /m modifier says "multiline"). Then [^x] matches \n because \n is a character that is not x. > So ok... that means basically a pattern like \r[^\n] needs at least to > characters to match. Yes. A class [...] always matches exactly one character. > Am I right then, that \r\n is _not_ matched here, because \n doesn’t appear > at all as a character (because it’s considered not a character in that sense > but the "mark up" for line separating)? No, you are not right; \n is a character. The reason that \r\n does not match \r[^\n] is because [^\n] matches any character that is not \n. Using \r and \n is perhaps confusing; this is no different in the way it works to the pattern "a[^b]", which matches "a", followed by any character that is not "b". > Ah.... ok so [...] always means there needs to be a character... Yes! > and if I put in ^$ it just says "at that position, there must be a > character, but the end-of-line condition must NOT be met. No! Because $ is _not_ a metacharacter when it is part of a character class [...]. It is just an ordinary character there, so [^$] matches any character that is not a dollar. With respect, you really do need to read that book I recommended above. > a) It's still not clear why a plain $ doesn't match... I would expect it to > _always_ match... as I would for ^ > > b) The case: > $ hd $file > 00000000 41 0d 0a |A..| > 00000003 > $ pcregrep '\n' $file ; echo $? > 1 > Is that, because in UNIX (or rather when the end-of-line is set to \n... \n > will never match, because again, the \n is not considered a char but rather > the > condition "end-of-line". The reason this does not match is because of the way pcregrep works. This is the same as the way GNU grep works. Basically, it is a line-based matching process, and in effect, terminal newline characters are stripped from each line before matching happens. So searching for newline character can never work. However, pcregrep does have a -M (multiline) option, which then makes this work. If you use pcretest instead of pcregrep, where there is better control, this pattern also matches. > This goes now rather towards support and less towards (invalid) bug > reporting: Is there a way in PCRE to do what I wanted... e.g. matching > a CR, that is not followed by an LF? This pattern does that: \r(?<!\n) It does precisely what you ask: first find \r, then look ahead and assert that it is not followed by \n. But if you want to use this in pcregrep, you'll need the -M option to make it search more than one line at once. > Or can I check for e.g. LFCRs, who are not actually just CRLFCR i.e. checking > for LFCR which are not prefixed by another CR. Yes, assertions should be able to do that. > What |Removed |Added > ---------------------------------------------------------------------------- > Status|RESOLVED |REOPENED > Resolution|INVALID | I see no point in reopening this, because it is not a bug. > 1) May I suggest (which is why I reopened this "bug") that you add a small > example at the places you've mentioned. > Perhaps one similar to mine e.g.: > This means that "^a[^$]" would not match the single line "a", but only e.g. > "ab" > > 2) Further, it would perhaps make sense to tell what this particular condition > is... I guess "the end of the current input line has been met" (in contrast to > "I found and end-of-line character). It very clearly says that [...] always matches a character. It is also very clear, in the section "Characters and metacharacters" that $ is a metacharacter only outside square brackets. I really don't think it is worth saying any more. > So basically people are looking for kinda binarygrep (just google around, I'm > not the only one missing this). Perhaps pcregrep with the -M option might do what you want. Philip -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
