RegEx: finding a string that does not contain /<(w ...(-...)?|c ...)>/

Stefan Th. Gries Tue, 05 Sep 2006 05:20:46 -0700

Hi all

I have a regex question I can't solve. I know this is a realy long posting but 
in order to explain the problem, I first say what I can do and then what I 
can't. Any ideas, pointers, snippets of code etc. would be really appreciated 
...
Thx,
STG




--------------------
I.    This I can do ...
--------------------

I have an array @a with character strings:

@a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.")
  "<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.")

The defining characteristic of the character strings in the array are that 
every word and every punctuation mark is preceded by a tag with the following 
structure: /<(w ...(-...)?|c ...)>/

(a) I want to retrieve the sequence of

- a word tagged as <w CJC>, immediately followed by
- a word tagged as <w DT0>.

Since every tag starts with /</, I use this regex: /<w CJC>[^<]*?<w DT0>[^<]*/, 
which works just fine by retrieving only @a[0].

(b) I want to retrieve the sequence of

- a word tagged as <w CJC>, followed by
- between 0 and 2 words and their tags (again, looking like this: /<(w 
...(-...)?|c ...)>/), followed by
- a word tagged as <w DT0>.

I use this regex: /<w CJC>[^<]*?(<[wc] (...|...-...)>[^<]*?){0,2}<w DT0>[^<]*/, 
which works just fine by retrieving only @a[0:1]. (I know I could use "?:" to 
avoid the capturing for the backreference but I don't care about that at the 
moment.)



----------------------
II.    This I can't ...
----------------------

I have an array @b with character strings:

@b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <w DT0>that <w NN2>cars",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w 
DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w 
DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c 
PUN>.",
  "<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.",
  "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.")

I basically want to do the same things as above, but the complication is that 
there are now additional kinds of tags -- tags that are not /<(w ...(-...)?|c 
...)>/ -- and my problem is how to skip them, to disregard them for the match. 
Thus,

(a) I want to retrieve those elements of @b in which "<w CJC>" and "<w DT0>" are

- directly adjacent, or
- not interrupted by any word with its tag (again, looking like this: /<(w 
...(-...)?|c ...)>/).

That is, I need to say something like "return everything from /<w CJC>/ and /<w 
DT0>/ but not if there is any /<(w ...(-...)?|c ...)>/ in between the two, then 
return nothing". Thus, of the array @b I would like to get back the first eight 
elements, but not the last four elements:

@b[0]: yes, because only separated by a space
@b[1]: yes, because only separated by a space
@b[2]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by 
/<ptr[^>]+>/
@b[3]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by 
/<ptr[^>]+>/
@b[4]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by 
/<ptr[^>]+>/
@b[5]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by /<p 
tr[^>]+>/ and /<ptr[^>]+>/
@b[6]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by 
/<w[^>]+>/
@b[7]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by 
/<c[^>]+>/
@b[8]: no, because interrupted by, among other things, /<c PUN>/
@b[9]: no, because interrupted by, among other things, /<w NN2-VVZ>/
@b[10]: no, because interrupted by, among other things, /<w AJ0>hungry/
@b[11]: no, because interrupted by, among other things, /<w AJ0>/ and /<c PUN>/

I do not use Perl, but R, so the regex

- *must* involve Perl-compatible regular expressions;
- would ideally work without lookaround (but if lookaround is absolutely 
needed, so be it).

The best I came up with was this (again, I don't care putting in "?:"): /<w 
CJC>[^<]+(<[^wc].*?>.*?)*<w DT0>[^<]*?/ but this does of course not work for 
@b[6:7] because the relevant part of the regex only says /<[wc]/, but I need to 
rule out all this /<(w ...(-...)?|c ...)>/.

(b) I want to retrieve the sequence of

- a word tagged as <w CJC>, followed by
- between 0 and 2 words and their tags (again, looking like this: /<(w 
...(-...)?|c ...)>/), followed by
- a word tagged as <w DT0>.

Again, the regex

- *must* involve Perl-compatible regular expressions;
- would ideally work without lookaround (but if lookaround is absolutely 
needed, so be it).


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  44,85   inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

RegEx: finding a string that does not contain /<(w ...(-...)?|c ...)>/

Reply via email to