Dear all
This is a follow-up to an earlier posting today regarding a regular expression
question. In the meantime, this is the best approximation I could come up with
and should give you a better idea what I am talking about.
a<-c("a blockage and that.",
"a blockage and that.",
"a blockage and, that.",
"a blockage and hungry that.")
matches<-gregexpr("[^<]+(?:<[^wc].*?>.*?)*that", a, perl=TRUE)
starts<-unlist(matches)
lengths<-unlist(sapply(matches, attributes))
stops<-starts+lengths-1
substr(a, starts, stops)
What is still missing is that the disallowed string is not just "<[wc]" but
"<[wc] " and I don't know how to do that. Any ideas (preferably with
lookarounds)?
Thanks a bunch,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
---
ORIGINAL MESSAGE
> Dear all
>
> I again have a regular expression question. I have this character vector a:
>
> a<-c("a blockage and that.",
> "a blockage and that PUN>.",
> "a blockage and, that.",
> "a blockage and hungry that.")
>
> I would like to retrieve those elements of a in which "" and ""
> are
>
> - directly adjacent, as in a[1] or
> - not interrupted by "<[wc] ", as in a[2]
>
> And, of these elements I would like to consume all characters from the "<" in
> "" that is not a "<". For
> example, if I was only searching a[1], I would like something like this:
>
> matches<-gregexpr("[^<]+?[^<]+", a[1], perl=TRUE)
> substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]],
> "match.length"))-1)
>
> I have been fiddling around with negative lookahead but I really can't get my
> head around this. Any pointers would be greatly appreciated. Thanks a lot,
> STG
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.