Stefan Th. Gries am Dienstag, 5. September 2006 14:20: > Hi all Hallo Stefan
> I have a regex question I can't solve. I know this is a realy long posting > but in order to explain the problem, I first say what I can do and then > what I can't. Any ideas, pointers, snippets of code etc. would be really > appreciated ... Thx, > STG As you can see from the mail date, I didn't spend days to answer :-) What I will present is a script to - generate regexes (to be used in R) - to test them - demonstrate the building of complex regexes from parts The regexes might no be exactly correct, the names could be better chosen, I didn't care much of capturing parenthesis and x modifier and comments etc. I couldn't find a way without lookahead. But the regexes select the cases you wish. > -------------------- > I. This I can do ... > -------------------- > > I have an array @a with character strings: > > @a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.") > "<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c > PUN>.") > > The defining characteristic of the character strings in the array are that > every word and every punctuation mark is preceded by a tag with the > following structure: /<(w ...(-...)?|c ...)>/ > > (a) I want to retrieve the sequence of > > - a word tagged as <w CJC>, immediately followed by > - a word tagged as <w DT0>. > > Since every tag starts with /</, I use this regex: /<w CJC>[^<]*?<w > DT0>[^<]*/, which works just fine by retrieving only @a[0]. > > (b) I want to retrieve the sequence of > > - a word tagged as <w CJC>, followed by > - between 0 and 2 words and their tags (again, looking like this: /<(w > ...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>. > > I use this regex: /<w CJC>[^<]*?(<[wc] (...|...-...)>[^<]*?){0,2}<w > DT0>[^<]*/, which works just fine by retrieving only @a[0:1]. (I know I > could use "?:" to avoid the capturing for the backreference but I don't > care about that at the moment.) > > > > ---------------------- > II. This I can't ... > ---------------------- > > I have an array @b with character strings: > > @b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <w DT0>that <w NN2>cars", > "<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w > DT0>that<c PUN>.", "<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr > target=KB2LC004> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <wtr > target=KB2LC003><w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c > PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.", > "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.") > > I basically want to do the same things as above, but the complication is > that there are now additional kinds of tags -- tags that are not /<(w > ...(-...)?|c ...)>/ -- and my problem is how to skip them, to disregard > them for the match. Thus, > > (a) I want to retrieve those elements of @b in which "<w CJC>" and "<w > DT0>" are > > - directly adjacent, or > - not interrupted by any word with its tag (again, looking like this: /<(w > ...(-...)?|c ...)>/). > > That is, I need to say something like "return everything from /<w CJC>/ and > /<w DT0>/ but not if there is any /<(w ...(-...)?|c ...)>/ in between the > two, then return nothing". Thus, of the array @b I would like to get back > the first eight elements, but not the last four elements: > > @b[0]: yes, because only separated by a space > @b[1]: yes, because only separated by a space > @b[2]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by > /<ptr[^>]+>/ @b[3]: yes, because not interrupted by /<(w ...(-...)?|c > ...)>/, only by /<ptr[^>]+>/ @b[4]: yes, because not interrupted by /<(w > ...(-...)?|c ...)>/, only by /<ptr[^>]+>/ @b[5]: yes, because not > interrupted by /<(w ...(-...)?|c ...)>/, only by /<p tr[^>]+>/ and > /<ptr[^>]+>/ @b[6]: yes, because not interrupted by /<(w ...(-...)?|c > ...)>/, only by /<w[^>]+>/ @b[7]: yes, because not interrupted by /<(w > ...(-...)?|c ...)>/, only by /<c[^>]+>/ @b[8]: no, because interrupted by, > among other things, /<c PUN>/ > @b[9]: no, because interrupted by, among other things, /<w NN2-VVZ>/ > @b[10]: no, because interrupted by, among other things, /<w AJ0>hungry/ > @b[11]: no, because interrupted by, among other things, /<w AJ0>/ and /<c > PUN>/ > > I do not use Perl, but R, so the regex > > - *must* involve Perl-compatible regular expressions; > - would ideally work without lookaround (but if lookaround is absolutely > needed, so be it). > > The best I came up with was this (again, I don't care putting in "?:"): /<w > CJC>[^<]+(<[^wc].*?>.*?)*<w DT0>[^<]*?/ but this does of course not work > for @b[6:7] because the relevant part of the regex only says /<[wc]/, but I > need to rule out all this /<(w ...(-...)?|c ...)>/. > > (b) I want to retrieve the sequence of > > - a word tagged as <w CJC>, followed by > - between 0 and 2 words and their tags (again, looking like this: /<(w > ...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>. > > Again, the regex > > - *must* involve Perl-compatible regular expressions; > - would ideally work without lookaround (but if lookaround is absolutely > needed, so be it). #!/usr/bin/perl use strict; use warnings; my $w_CJC =qr/(?:<w CJC>)/; my $w_DT0 =qr/(?:<w DT0>)/; my $generic1=qr/(?:<(w ...(-...)?|c ...)>)/; my $ptr =qr/(?:<ptr[^>]+>)/; my $p_tr =qr/(?:<p tr[^>]+>)/; my $re_w =qr/(?:<w[^ ][^>]+>)/; # NOTE [^ ] to distinct from $generic1 my $re_c =qr/(?:<c[^ ][^>]+>)/; # dito my $text =qr/(?:[^<>]*)/; # what follows the tags my $disregard =qr/$text|$ptr|$p_tr/; my $not_generic1=qr/(?:$w_CJC|$w_DT0|$ptr|$p_tr|$re_w|$re_c)$text/; # just to check if selection is ok # sub retrieve { my ($aref, $regex)[EMAIL PROTECTED]; for my $str (@$aref) { if ($str=~/$regex/) {warn "retrieved: $str\n";} else {warn "ignored: $str\n";} } } my @a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>."); my @b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w DT0>that <w NN2>cars", "<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>."); my $re_1a=qr/$w_CJC$text$w_DT0$text/; my $re_1b=qr/$w_CJC$text(?:$generic1$text){0,2}$w_DT0$text/; my $re_not_interrupted_by_generic=qr/($not_generic1?(?!(?:$generic1$text)+)?)*?/; my $re_2a=qr/$w_CJC$text$re_not_interrupted_by_generic$w_DT0$text/; warn "\n*** 1a /$re_1a/\n\n"; retrieve([EMAIL PROTECTED], $re_1a); warn "\n*** 1b /$re_1b/\n\n"; retrieve([EMAIL PROTECTED], $re_1b); warn "\n*** 2a /$re_2a/\n\n"; retrieve([EMAIL PROTECTED], $re_2a); __END__ The output is: *** 1a /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?-xism:(?:<w DT0>))(?-xism:(?:[^<>]*)))/ retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>. ignored: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>. ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>. *** 1b /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?:(?-xism:(?:<(w ...(-...)?|c ...)>))(?-xism:(?:[^<>]*))){0,2}(?-xism:(?:<w DT0>))(?-xism:(?:[^<>]*)))/ retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>. retrieved: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>. ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>. *** 2a /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?-xism:((?-xism:(?:(?-xism:(?:<w CJC>))|(?-xism:(?:<w DT0>))|(?-xism:(?:<ptr[^>]+>))|(?-xism:(?:<p tr[^>]+>))|(?-xism:(?:<w[^ ][^>]+>))|(?-xism:(?:<c[^ ][^>]+>)))(?-xism:(?:[^<>]*)))?(?!(?:(?-xism:(?:<(w ...(-...)?|c ...)>))(?-xism:(?:[^<>]*)))+)?)*?)(?-xism:(?:<w DT0>))(?-xism:(?:[^<>]*)))/ retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>. retrieved: <w AT0>a <w CJC>and <w DT0>that <w NN2>cars retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>. retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>. retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>. retrieved: <w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>. retrieved: <w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>. retrieved: <w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>. ignored: <w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>. ignored: <w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c PUN>. ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>. ignored: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>. Hope this helps a bit :-) Dani -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>