Stefan Th. Gries am Dienstag, 5. September 2006 14:20:
> Hi all

Hallo Stefan

> I have a regex question I can't solve. I know this is a realy long posting
> but in order to explain the problem, I first say what I can do and then
> what I can't. Any ideas, pointers, snippets of code etc. would be really
> appreciated ... Thx,
> STG

As you can see from the mail date, I didn't spend days to answer :-)

What I will present is a script to
- generate regexes (to be used in R)
- to test them
- demonstrate the building of complex regexes from parts

The regexes might no be exactly correct, the names could be better chosen, I 
didn't care much of capturing parenthesis and x modifier and comments etc.

I couldn't find a way without lookahead.

But the regexes select the cases you wish.

> --------------------
> I.    This I can do ...
> --------------------
>
> I have an array @a with character strings:
>
> @a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.")
>   "<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c
> PUN>.")
>
> The defining characteristic of the character strings in the array are that
> every word and every punctuation mark is preceded by a tag with the
> following structure: /<(w ...(-...)?|c ...)>/
>
> (a) I want to retrieve the sequence of
>
> - a word tagged as <w CJC>, immediately followed by
> - a word tagged as <w DT0>.
>
> Since every tag starts with /</, I use this regex: /<w CJC>[^<]*?<w
> DT0>[^<]*/, which works just fine by retrieving only @a[0].
>
> (b) I want to retrieve the sequence of
>
> - a word tagged as <w CJC>, followed by
> - between 0 and 2 words and their tags (again, looking like this: /<(w
> ...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>.
>
> I use this regex: /<w CJC>[^<]*?(<[wc] (...|...-...)>[^<]*?){0,2}<w
> DT0>[^<]*/, which works just fine by retrieving only @a[0:1]. (I know I
> could use "?:" to avoid the capturing for the backreference but I don't
> care about that at the moment.)
>
>
>
> ----------------------
> II.    This I can't ...
> ----------------------
>
> I have an array @b with character strings:
>
> @b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <w DT0>that <w NN2>cars",
>   "<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w
> DT0>that<c PUN>.", "<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr
> target=KB2LC004> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <wtr
> target=KB2LC003><w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c
> PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.",
>   "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.")
>
> I basically want to do the same things as above, but the complication is
> that there are now additional kinds of tags -- tags that are not /<(w
> ...(-...)?|c ...)>/ -- and my problem is how to skip them, to disregard
> them for the match. Thus,
>
> (a) I want to retrieve those elements of @b in which "<w CJC>" and "<w
> DT0>" are
>
> - directly adjacent, or
> - not interrupted by any word with its tag (again, looking like this: /<(w
> ...(-...)?|c ...)>/).
>
> That is, I need to say something like "return everything from /<w CJC>/ and
> /<w DT0>/ but not if there is any /<(w ...(-...)?|c ...)>/ in between the
> two, then return nothing". Thus, of the array @b I would like to get back
> the first eight elements, but not the last four elements:
>
> @b[0]: yes, because only separated by a space
> @b[1]: yes, because only separated by a space
> @b[2]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by
> /<ptr[^>]+>/ @b[3]: yes, because not interrupted by /<(w ...(-...)?|c
> ...)>/, only by /<ptr[^>]+>/ @b[4]: yes, because not interrupted by /<(w
> ...(-...)?|c ...)>/, only by /<ptr[^>]+>/ @b[5]: yes, because not
> interrupted by /<(w ...(-...)?|c ...)>/, only by /<p tr[^>]+>/ and
> /<ptr[^>]+>/ @b[6]: yes, because not interrupted by /<(w ...(-...)?|c
> ...)>/, only by /<w[^>]+>/ @b[7]: yes, because not interrupted by /<(w
> ...(-...)?|c ...)>/, only by /<c[^>]+>/ @b[8]: no, because interrupted by,
> among other things, /<c PUN>/
> @b[9]: no, because interrupted by, among other things, /<w NN2-VVZ>/
> @b[10]: no, because interrupted by, among other things, /<w AJ0>hungry/
> @b[11]: no, because interrupted by, among other things, /<w AJ0>/ and /<c
> PUN>/
>
> I do not use Perl, but R, so the regex
>
> - *must* involve Perl-compatible regular expressions;
> - would ideally work without lookaround (but if lookaround is absolutely
> needed, so be it).
>
> The best I came up with was this (again, I don't care putting in "?:"): /<w
> CJC>[^<]+(<[^wc].*?>.*?)*<w DT0>[^<]*?/ but this does of course not work
> for @b[6:7] because the relevant part of the regex only says /<[wc]/, but I
> need to rule out all this /<(w ...(-...)?|c ...)>/.
>
> (b) I want to retrieve the sequence of
>
> - a word tagged as <w CJC>, followed by
> - between 0 and 2 words and their tags (again, looking like this: /<(w
> ...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>.
>
> Again, the regex
>
> - *must* involve Perl-compatible regular expressions;
> - would ideally work without lookaround (but if lookaround is absolutely
> needed, so be it).

#!/usr/bin/perl

use strict;
use warnings;

my $w_CJC   =qr/(?:<w CJC>)/;
my $w_DT0   =qr/(?:<w DT0>)/;

my $generic1=qr/(?:<(w ...(-...)?|c ...)>)/;

my $ptr     =qr/(?:<ptr[^>]+>)/;
my $p_tr    =qr/(?:<p tr[^>]+>)/;
my $re_w    =qr/(?:<w[^ ][^>]+>)/; # NOTE [^ ] to distinct from $generic1
my $re_c    =qr/(?:<c[^ ][^>]+>)/; # dito

my $text    =qr/(?:[^<>]*)/; # what follows the tags

my $disregard   =qr/$text|$ptr|$p_tr/;

my $not_generic1=qr/(?:$w_CJC|$w_DT0|$ptr|$p_tr|$re_w|$re_c)$text/;


# just to check if selection is ok
#
sub retrieve {
  my ($aref, $regex)[EMAIL PROTECTED];
  for my $str (@$aref) {
    if ($str=~/$regex/) {warn "retrieved: $str\n";}
    else {warn "ignored: $str\n";}
  }
}


my @a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c 
PUN>.");


my @b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <w DT0>that <w NN2>cars",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w 
DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w 
DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c 
PUN>.",
  "<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.");


my $re_1a=qr/$w_CJC$text$w_DT0$text/;

my $re_1b=qr/$w_CJC$text(?:$generic1$text){0,2}$w_DT0$text/;


my 
$re_not_interrupted_by_generic=qr/($not_generic1?(?!(?:$generic1$text)+)?)*?/;

my $re_2a=qr/$w_CJC$text$re_not_interrupted_by_generic$w_DT0$text/;


warn "\n*** 1a /$re_1a/\n\n";
retrieve([EMAIL PROTECTED], $re_1a);

warn "\n*** 1b /$re_1b/\n\n";
retrieve([EMAIL PROTECTED], $re_1b);

warn "\n*** 2a /$re_2a/\n\n";
retrieve([EMAIL PROTECTED], $re_2a);

__END__

The output is:

*** 1a /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?-xism:(?:<w 
DT0>))(?-xism:(?:[^<>]*)))/

retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c 
PUN>.

*** 1b /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?:(?-xism:(?:<(w 
...(-...)?|c ...)>))(?-xism:(?:[^<>]*))){0,2}(?-xism:(?:<w 
DT0>))(?-xism:(?:[^<>]*)))/

retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c 
PUN>.

*** 2a /(?-xism:(?-xism:(?:<w 
CJC>))(?-xism:(?:[^<>]*))(?-xism:((?-xism:(?:(?-xism:(?:<w CJC>))|(?-xism:(?:<w 
DT0>))|(?-xism:(?:<ptr[^>]+>))|(?-xism:(?:<p tr[^>]+>))|(?-xism:(?:<w[^ 
][^>]+>))|(?-xism:(?:<c[^ ][^>]+>)))(?-xism:(?:[^<>]*)))?(?!(?:(?-xism:(?:<(w 
...(-...)?|c ...)>))(?-xism:(?:[^<>]*)))+)?)*?)(?-xism:(?:<w 
DT0>))(?-xism:(?:[^<>]*)))/

retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <w DT0>that <w NN2>cars
retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w 
DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w 
DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c 
PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.


Hope this helps a bit :-)

Dani

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to