Re: [R] Regular expressions: offsets of groups

2010-09-30 Thread Titus von der Malsburg
Ok, we decided to have a shot at modifying gregexpr.  Let's see how it
works out.  If anybody is interested in discussing this please contact
me.  R-help doesn't seem like the right place for further discussion.
Is there a default place for discussing things like that?

Thanks everybody for your responses!

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Titus von der Malsburg
Bill, Michael,

good to see I'm not the only one who sees potential for improvements
in the regexpr domain.  Adding a subpattern argument is certainly a
step in the right direction and would make my life much easier.
However, in my application I need to know not only the position of one
group but also the position of the overall match in the original
string.  The ideal solution would provide positions and match lengths
for the whole pattern and for all groups if desired.  Only this would
solve all related issues.  One possibility is to have a subpattern
argument that accepts a vector of numbers (0 refers to the whole
pattern):

   gregexpr(a+(b+), abcdaabbc, subpattern=c(0,1))
 [[1]]:
 [[1]][[1]]:
 [1] 1 5
 attr(, match.length):
 [1] 2 4
 [[1]][[2]]:
 [1] 2 7
 attr(, match.length):
 [1] 1 2

A weakness of this solution is that the structure of the return values
changes if length(subpattern)1.  An alternative is to have a separate
function, say ggregepxr for group gregexpr, that returns a list of
lists as in the above example.  This function would always return
positions and match lengths of the whole pattern (group 0) and all
groups.  The original gregexpr could still have the subpattern
argument but it would only accept single numbers.  This way the return
format of gregexpr remains the same.

Best,

  Titus


On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
michael.bedw...@gmail.com wrote:
 Ah, that's interesting - thanks Bill. That's certainly on the right
 track for me (Titus, you too ?) especially if the subpattern argument
 accepted a vector of multiple group indices.

 As you say, this is straightforward in C. I'd be happy to (try to)
 make a patch for the R sources if there was some consensus on the best
 way to implement it, ie. as a new R function or by extending existing
 function(s).

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Michael Bedward
I'd definitely be a customer for it Titus. And it does seem like an
obvious hole in regex processing in R that cries out to be filled.

Um, ggregexpr isn't the sexiest of function names :)  Perhaps we can
think of something a little easier ?

How is your C coding ? Bill ? Anyone else ?  I could have a got at
writing some prototype code to test in the next few days, though if
someone else with decent C skills is itching to do it please speak up.

Michael

On 29 September 2010 20:08, Titus von der Malsburg malsb...@gmail.com wrote:
 Bill, Michael,

 good to see I'm not the only one who sees potential for improvements
 in the regexpr domain.  Adding a subpattern argument is certainly a
 step in the right direction and would make my life much easier.
 However, in my application I need to know not only the position of one
 group but also the position of the overall match in the original
 string.  The ideal solution would provide positions and match lengths
 for the whole pattern and for all groups if desired.  Only this would
 solve all related issues.  One possibility is to have a subpattern
 argument that accepts a vector of numbers (0 refers to the whole
 pattern):

   gregexpr(a+(b+), abcdaabbc, subpattern=c(0,1))
  [[1]]:
  [[1]][[1]]:
  [1] 1 5
  attr(, match.length):
  [1] 2 4
  [[1]][[2]]:
  [1] 2 7
  attr(, match.length):
  [1] 1 2

 A weakness of this solution is that the structure of the return values
 changes if length(subpattern)1.  An alternative is to have a separate
 function, say ggregepxr for group gregexpr, that returns a list of
 lists as in the above example.  This function would always return
 positions and match lengths of the whole pattern (group 0) and all
 groups.  The original gregexpr could still have the subpattern
 argument but it would only accept single numbers.  This way the return
 format of gregexpr remains the same.

 Best,

  Titus


 On Wed, Sep 29, 2010 at 2:42 AM, Michael Bedward
 michael.bedw...@gmail.com wrote:
 Ah, that's interesting - thanks Bill. That's certainly on the right
 track for me (Titus, you too ?) especially if the subpattern argument
 accepted a vector of multiple group indices.

 As you say, this is straightforward in C. I'd be happy to (try to)
 make a patch for the R sources if there was some consensus on the best
 way to implement it, ie. as a new R function or by extending existing
 function(s).


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-29 Thread Titus von der Malsburg
On Wed, Sep 29, 2010 at 1:58 PM, Michael Bedward
michael.bedw...@gmail.com wrote:
 How is your C coding ? Bill ? Anyone else ?  I could have a got at
 writing some prototype code to test in the next few days, though if
 someone else with decent C skills is itching to do it please speak up.

We have a skilled C- and R-programmer who could work on it. I'll talk to him.

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Michael Bedward
What Titus wants to do is akin to retrieving capturing groups from a
Matcher object in Java. I also thought there must be an existing,
elegant solution to this some time ago and searched for it, including
looking at the sources (albeit with not much expertise) but came up
blank.

I also looked at the stringr package (which is nice) but it doesn't
quite do it either.

Michael

On 28 September 2010 01:48, Titus von der Malsburg malsb...@gmail.com wrote:
 Dear list!

 gregexpr(a+(b+), abcdaabbc)
 [[1]]
 [1] 1 5
 attr(,match.length)
 [1] 2 4

 What I want is the offsets of the matches for the group (b+), i.e. 2
 and 7, not the offsets of the complete matches.  Is there a way in R
 to get that?

 I know about gsubgn and strapply, but they only give me the strings
 matched by groups not their offsets.

 I could write something myself that first takes the above matches
 (ab and aabb) and then searches again using only the group (b+).
 For this to work, I'd have to parse the regular expression and search
 several times ( 2, for nested groups) instead of just once.  But I'm
 sure there is a better way to do this.

 Thanks for any suggestion!

   Titus

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Titus von der Malsburg
On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
michael.bedw...@gmail.com wrote:
 What Titus wants to do is akin to retrieving capturing groups from a
 Matcher object in Java.

Precisely.  Here's the description:

  
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)

Gabor's lookbehind trick solves some special cases but it's not the
kind of general solution I'm looking for.  Let me explain what I'm
trying to achieve here.  I'm working on a package that provides tools
for processing and analyzing eye movements (we're doing reading
research).  In most situations, eye movements consist of fixations
where the eyes are relatively stationary and saccades, quick movements
between fixations.  A common way to represent eye movements is as
strings of symbols, where each symbol corresponds to a fixation on a
particular region.  AABC means two fixations followed by a fixation on
B and then C.  When people analyze eye movements it's often necessary
to find specific events in the eye movement record like: fixations on
the word C preceded by fixations on words D-F and followed by
fixations on words A-C.  This event can be specified using this
regexpr: [D-F]+(C)[A-C]+  The group (in parenthesis) indicates the
substring for which I'd like to know the position in the overall
string.  Another application is the extraction of subsequences from a
sequence of fixations.  Note that in some situations people might have
to use more groups in their regexprs and that groups can be nested.
In this case the user would have to indicate for which group he/she
wants to know the offset.  I'm not an expert for regexpr engines but
I'm pretty sure the necessary information is available in the engine.

Gabor, I see you're the author of gsubfn (fantastic package!).  Do you
see a relatively simple way to expose information about group offsets
and their corresponding match lengths?  I think this could be useful
for other applications as well.  At least it seems Michael could use
it, too.  We can cook up something for ourselves but a general
solution would benefit the larger community.

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Gabor Grothendieck
On Tue, Sep 28, 2010 at 6:52 AM, Titus von der Malsburg
malsb...@gmail.com wrote:
 On Tue, Sep 28, 2010 at 9:46 AM, Michael Bedward
 michael.bedw...@gmail.com wrote:
 What Titus wants to do is akin to retrieving capturing groups from a
 Matcher object in Java.

 Precisely.  Here's the description:

  http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#start(int)

 Gabor's lookbehind trick solves some special cases but it's not the

The only limitation is that in the regular expressions supported by R
you cannot have repitition in the (=...) portion but none of your
examples -- neither the one you gave nor the one below require that
since if the prior expression ends in X+ you can just use X.Are
you sure it does not cover all your actual situations?

If you truly do have situations where that require repetition a
gregexpr plus gsubfn will do it in one line.   Parenthesize the
portion of the regular expression you want to capture and replace
every character in it with X (or some other character that does not
otherwise occur).  Then find the positions and lengths of strings of
X.

 gregexpr(X+, gsubfn(a(b+), ~ gsub(., X, x), abcdaabbcbbb))
[[1]]
[1] 1 5
attr(,match.length)
[1] 1 2

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread William Dunlap

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Michael Bedward
 Sent: Tuesday, September 28, 2010 12:46 AM
 To: Titus von der Malsburg
 Cc: r-help@r-project.org
 Subject: Re: [R] Regular expressions: offsets of groups
 
 What Titus wants to do is akin to retrieving capturing groups from a
 Matcher object in Java. I also thought there must be an existing,
 elegant solution to this some time ago and searched for it, including
 looking at the sources (albeit with not much expertise) but came up
 blank.
 
 I also looked at the stringr package (which is nice) but it doesn't
 quite do it either.

S+ has a subpattern=number argument to regexpr and
related functions.  It means that the text matched
by the subpattern'th parenthesized expression in the
pattern will be considered the matched text.  E.g.,
to find runs of b's that come immediately after a's:

   gregexpr(a+(b+), abcdaabbc, subpattern=1)
  [[1]]:
  [1] 2 7
  attr(, match.length):
  [1] 1 2

or to find bc's that come after 2 or more ab's
   gregexpr((ab){2,}bc, abbcabababbcabcababbc, subpattern=1)

regexpr() and strsplit() have this argument in S+ 8.1 but
gregexpr() is not yet in a released version of S+.

subpattern=0, the default, means to use the entire
pattern.  regexpr allows subpattern=-1, which means
to return a list with one element for each subpattern.
I don't know if the extra complexity is worth it.
(gregexpr does not allow subpattern=-1.)

The usual C regexec() returns this information.
Perhaps it would be handy to have it in R.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

 
 Michael
 
 On 28 September 2010 01:48, Titus von der Malsburg 
 malsb...@gmail.com wrote:
  Dear list!
 
  gregexpr(a+(b+), abcdaabbc)
  [[1]]
  [1] 1 5
  attr(,match.length)
  [1] 2 4
 
  What I want is the offsets of the matches for the group (b+), i.e. 2
  and 7, not the offsets of the complete matches.  Is there a way in R
  to get that?
 
  I know about gsubgn and strapply, but they only give me the strings
  matched by groups not their offsets.
 
  I could write something myself that first takes the above matches
  (ab and aabb) and then searches again using only the group (b+).
  For this to work, I'd have to parse the regular expression 
 and search
  several times ( 2, for nested groups) instead of just 
 once.  But I'm
  sure there is a better way to do this.
 
  Thanks for any suggestion!
 
    Titus
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-28 Thread Michael Bedward
Ah, that's interesting - thanks Bill. That's certainly on the right
track for me (Titus, you too ?) especially if the subpattern argument
accepted a vector of multiple group indices.

As you say, this is straightforward in C. I'd be happy to (try to)
make a patch for the R sources if there was some consensus on the best
way to implement it, ie. as a new R function or by extending existing
function(s).

Michael

On 29 September 2010 01:46, William Dunlap wrote:

 S+ has a subpattern=number argument to regexpr and
 related functions.  It means that the text matched
 by the subpattern'th parenthesized expression in the
 pattern will be considered the matched text.  E.g.,
 to find runs of b's that come immediately after a's:

   gregexpr(a+(b+), abcdaabbc, subpattern=1)
  [[1]]:
  [1] 2 7
  attr(, match.length):
  [1] 1 2

 or to find bc's that come after 2 or more ab's
   gregexpr((ab){2,}bc, abbcabababbcabcababbc, subpattern=1)

 regexpr() and strsplit() have this argument in S+ 8.1 but
 gregexpr() is not yet in a released version of S+.

 subpattern=0, the default, means to use the entire
 pattern.  regexpr allows subpattern=-1, which means
 to return a list with one element for each subpattern.
 I don't know if the extra complexity is worth it.
 (gregexpr does not allow subpattern=-1.)

 The usual C regexec() returns this information.
 Perhaps it would be handy to have it in R.

 Bill Dunlap
 Spotfire, TIBCO Software
 wdunlap tibco.com


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
Dear list!

 gregexpr(a+(b+), abcdaabbc)
[[1]]
[1] 1 5
attr(,match.length)
[1] 2 4

What I want is the offsets of the matches for the group (b+), i.e. 2
and 7, not the offsets of the complete matches.  Is there a way in R
to get that?

I know about gsubgn and strapply, but they only give me the strings
matched by groups not their offsets.

I could write something myself that first takes the above matches
(ab and aabb) and then searches again using only the group (b+).
For this to work, I'd have to parse the regular expression and search
several times ( 2, for nested groups) instead of just once.  But I'm
sure there is a better way to do this.

Thanks for any suggestion!

   Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread jim holtman
try this:

 x -  gregexpr(a+(b+), abcdaabbcaaacaaab)
 justA -  gregexpr(a+, abcdaabbcaaacaaab)
 # find matches in 'x' for 'justA'
 indx - which(justA[[1]] %in% x[[1]])
 # now determine where 'b' starts
 justA[[1]][indx] + attr(justA[[1]], 'match.length')[indx]
[1]  2  7 17



On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
malsb...@gmail.com wrote:
 Dear list!

 gregexpr(a+(b+), abcdaabbc)
 [[1]]
 [1] 1 5
 attr(,match.length)
 [1] 2 4

 What I want is the offsets of the matches for the group (b+), i.e. 2
 and 7, not the offsets of the complete matches.  Is there a way in R
 to get that?

 I know about gsubgn and strapply, but they only give me the strings
 matched by groups not their offsets.

 I could write something myself that first takes the above matches
 (ab and aabb) and then searches again using only the group (b+).
 For this to work, I'd have to parse the regular expression and search
 several times ( 2, for nested groups) instead of just once.  But I'm
 sure there is a better way to do this.

 Thanks for any suggestion!

   Titus

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
Thank you Jim, but just as the solution that I discussed, your
proposal involves deconstructing the pattern and searching several
times.  I'm looking for a general and efficient solution.  Internally,
the regexpr engine has all necessary information after one pass
through the string.  What I need is an interface that exposes this
information.

  Titus

On Mon, Sep 27, 2010 at 6:43 PM, jim holtman jholt...@gmail.com wrote:
 try this:

 x -  gregexpr(a+(b+), abcdaabbcaaacaaab)
 justA -  gregexpr(a+, abcdaabbcaaacaaab)
 # find matches in 'x' for 'justA'
 indx - which(justA[[1]] %in% x[[1]])
 # now determine where 'b' starts
 justA[[1]][indx] + attr(justA[[1]], 'match.length')[indx]
 [1]  2  7 17



 On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
 malsb...@gmail.com wrote:
 Dear list!

 gregexpr(a+(b+), abcdaabbc)
 [[1]]
 [1] 1 5
 attr(,match.length)
 [1] 2 4

 What I want is the offsets of the matches for the group (b+), i.e. 2
 and 7, not the offsets of the complete matches.  Is there a way in R
 to get that?

 I know about gsubgn and strapply, but they only give me the strings
 matched by groups not their offsets.

 I could write something myself that first takes the above matches
 (ab and aabb) and then searches again using only the group (b+).
 For this to work, I'd have to parse the regular expression and search
 several times ( 2, for nested groups) instead of just once.  But I'm
 sure there is a better way to do this.

 Thanks for any suggestion!

   Titus

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
On Mon, Sep 27, 2010 at 7:16 PM, Henrique Dallazuanna www...@gmail.com wrote:
 You've tried:

 gregexpr(b+, abcdaabbc)

But this would match the third occurrence of b+ in abcdaabbcbb.  But
in this example I'm only interested in b+ if it's preceded by a+.

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Gabor Grothendieck
On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg
malsb...@gmail.com wrote:
 Dear list!

 gregexpr(a+(b+), abcdaabbc)
 [[1]]
 [1] 1 5
 attr(,match.length)
 [1] 2 4

 What I want is the offsets of the matches for the group (b+), i.e. 2
 and 7, not the offsets of the complete matches.  Is there a way in R
 to get that?

 I know about gsubgn and strapply, but they only give me the strings
 matched by groups not their offsets.

 I could write something myself that first takes the above matches
 (ab and aabb) and then searches again using only the group (b+).
 For this to work, I'd have to parse the regular expression and search
 several times ( 2, for nested groups) instead of just once.  But I'm
 sure there is a better way to do this.


Try this zero width negative look behind expression:

 gregexpr((?!a+)(b+), abcdaabbc, perl = TRUE)
[[1]]
[1] 2 7
attr(,match.length)
[1] 1 2

See ?regexp for more info.

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Titus von der Malsburg
On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck
ggrothendi...@gmail.com wrote:
 Try this zero width negative look behind expression:

 gregexpr((?!a+)(b+), abcdaabbc, perl = TRUE)
 [[1]]
 [1] 2 7
 attr(,match.length)
 [1] 1 2

Thanks Gabor, but this gives me the same result as

  gregexpr(b+, abcdaabbc, perl = TRUE)

which is wrong if the string is abcdaabbcbbb.

  Titus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Henrique Dallazuanna
You could do this:

gregexpr(ab+, abcdaabbcbb)[[1]] + 1

On Mon, Sep 27, 2010 at 2:25 PM, Titus von der Malsburg
malsb...@gmail.comwrote:

 On Mon, Sep 27, 2010 at 7:16 PM, Henrique Dallazuanna www...@gmail.com
 wrote:
  You've tried:
 
  gregexpr(b+, abcdaabbc)

 But this would match the third occurrence of b+ in abcdaabbcbb.  But
 in this example I'm only interested in b+ if it's preceded by a+.

  Titus




-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40 S 49° 16' 22 O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Henrique Dallazuanna
You've tried:

gregexpr(b+, abcdaabbc)


On Mon, Sep 27, 2010 at 12:48 PM, Titus von der Malsburg malsb...@gmail.com
 wrote:

 Dear list!

  gregexpr(a+(b+), abcdaabbc)
 [[1]]
 [1] 1 5
 attr(,match.length)
 [1] 2 4

 What I want is the offsets of the matches for the group (b+), i.e. 2
 and 7, not the offsets of the complete matches.  Is there a way in R
 to get that?

 I know about gsubgn and strapply, but they only give me the strings
 matched by groups not their offsets.

 I could write something myself that first takes the above matches
 (ab and aabb) and then searches again using only the group (b+).
 For this to work, I'd have to parse the regular expression and search
 several times ( 2, for nested groups) instead of just once.  But I'm
 sure there is a better way to do this.

 Thanks for any suggestion!

   Titus

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40 S 49° 16' 22 O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular expressions: offsets of groups

2010-09-27 Thread Gabor Grothendieck
On Mon, Sep 27, 2010 at 1:34 PM, Titus von der Malsburg
malsb...@gmail.com wrote:
 On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck
 ggrothendi...@gmail.com wrote:
 Try this zero width negative look behind expression:

 gregexpr((?!a+)(b+), abcdaabbc, perl = TRUE)
 [[1]]
 [1] 2 7
 attr(,match.length)
 [1] 1 2

 Thanks Gabor, but this gives me the same result as

  gregexpr(b+, abcdaabbc, perl = TRUE)

 which is wrong if the string is abcdaabbcbbb.


Sorry, try this:

  gregexpr((?=a)b+, abcdaabbcbbb, perl = TRUE)
[[1]]
[1] 2 7
attr(,match.length)
[1] 1 2

Note that it does not give the same answer as:

  gregexpr(b+, abcdaabbcbbb, perl = TRUE)
[[1]]
[1]  2  7 10
attr(,match.length)
[1] 1 2 3


 gregexpr((?=a)b+, abcdaabbcbbb, perl = TRUE)




-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.