Hi all I have a regex question I can't solve. I know this is a realy long posting but in order to explain the problem, I first say what I can do and then what I can't. Any ideas, pointers, snippets of code etc. would be really appreciated ... Thx, STG
-------------------- I. This I can do ... -------------------- I have an array @a with character strings: @a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.") "<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.") The defining characteristic of the character strings in the array are that every word and every punctuation mark is preceded by a tag with the following structure: /<(w ...(-...)?|c ...)>/ (a) I want to retrieve the sequence of - a word tagged as <w CJC>, immediately followed by - a word tagged as <w DT0>. Since every tag starts with /</, I use this regex: /<w CJC>[^<]*?<w DT0>[^<]*/, which works just fine by retrieving only @a[0]. (b) I want to retrieve the sequence of - a word tagged as <w CJC>, followed by - between 0 and 2 words and their tags (again, looking like this: /<(w ...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>. I use this regex: /<w CJC>[^<]*?(<[wc] (...|...-...)>[^<]*?){0,2}<w DT0>[^<]*/, which works just fine by retrieving only @a[0:1]. (I know I could use "?:" to avoid the capturing for the backreference but I don't care about that at the moment.) ---------------------- II. This I can't ... ---------------------- I have an array @b with character strings: @b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w DT0>that <w NN2>cars", "<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.") I basically want to do the same things as above, but the complication is that there are now additional kinds of tags -- tags that are not /<(w ...(-...)?|c ...)>/ -- and my problem is how to skip them, to disregard them for the match. Thus, (a) I want to retrieve those elements of @b in which "<w CJC>" and "<w DT0>" are - directly adjacent, or - not interrupted by any word with its tag (again, looking like this: /<(w ...(-...)?|c ...)>/). That is, I need to say something like "return everything from /<w CJC>/ and /<w DT0>/ but not if there is any /<(w ...(-...)?|c ...)>/ in between the two, then return nothing". Thus, of the array @b I would like to get back the first eight elements, but not the last four elements: @b[0]: yes, because only separated by a space @b[1]: yes, because only separated by a space @b[2]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by /<ptr[^>]+>/ @b[3]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by /<ptr[^>]+>/ @b[4]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by /<ptr[^>]+>/ @b[5]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by /<p tr[^>]+>/ and /<ptr[^>]+>/ @b[6]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by /<w[^>]+>/ @b[7]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by /<c[^>]+>/ @b[8]: no, because interrupted by, among other things, /<c PUN>/ @b[9]: no, because interrupted by, among other things, /<w NN2-VVZ>/ @b[10]: no, because interrupted by, among other things, /<w AJ0>hungry/ @b[11]: no, because interrupted by, among other things, /<w AJ0>/ and /<c PUN>/ I do not use Perl, but R, so the regex - *must* involve Perl-compatible regular expressions; - would ideally work without lookaround (but if lookaround is absolutely needed, so be it). The best I came up with was this (again, I don't care putting in "?:"): /<w CJC>[^<]+(<[^wc].*?>.*?)*<w DT0>[^<]*?/ but this does of course not work for @b[6:7] because the relevant part of the regex only says /<[wc]/, but I need to rule out all this /<(w ...(-...)?|c ...)>/. (b) I want to retrieve the sequence of - a word tagged as <w CJC>, followed by - between 0 and 2 words and their tags (again, looking like this: /<(w ...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>. Again, the regex - *must* involve Perl-compatible regular expressions; - would ideally work without lookaround (but if lookaround is absolutely needed, so be it). Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer, nur 44,85 inkl. DSL- und ISDN-Grundgebühr! http://www.arcor.de/rd/emf-dsl-2 -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>