Re: [R] Grep with wildcards across multiple columns

arun Fri, 15 Mar 2013 07:31:38 -0700

Hi,


You could try this for multiple intersect:

 dt[Reduce(function(...) intersect(...), 
list(grep(par.fund,fund),grep(par.func,func),grep(par.obj,obj))),sum(amount),by=c('code','year')]
#   code year     V1
#1: 1001 2011 123528
#2: 1001 2012  97362
#3: 1002 2011 103811
#4: 1002 2012  97179
 dt[intersect(intersect(grep(par.fund, fund), grep(par.func, func)),
                     grep(par.obj, obj)),
           sum(amount), by=c('code', 'year')]
 #  code year     V1
#1: 1001 2011 123528
#2: 1001 2012  97362
#3: 1002 2011 103811
#4: 1002 2012  97179
A.K.



----- Original Message -----
From: "Bush,  Daniel P.   DPI" <[email protected]>
To: "'[email protected]'" <[email protected]>
Cc: 'William Dunlap' <[email protected]>; "'[email protected]'" 
<[email protected]>; "'[email protected]'" <[email protected]>
Sent: Friday, March 15, 2013 10:06 AM
Subject: RE: Grep with wildcards across multiple columns

I think the way I set up my sample data without any explanation confused things 
slightly. These data might make things clearer:

# Create fake data
df <- data.frame(code   = c(rep(1001, 8), rep(1002, 8)),
                 year   = rep(c(rep(2011, 4), rep(2012, 4)), 2),
                 fund   = rep(c("10E", "27E", "27E", "29E"), 4),
                 func   = rep(c("110000", "122000", "214000", "158000"), 4),
                 obj    = rep(c("100", "100", "210", "220"), 4),
                 amount = round(rnorm(16, 50000, 10000)))

These are financial data with a hierarchical account structure where a zero 
represents a summary account that rolls up all the accounts at subsequent 
digits (e.g. 100000 rolls up 110000, 122000, 158000, etc.). I was trying to do 
two things with the search parameters: turn zeroes into question marks, and 
duplicate the functionality of a SQL query using those question marks as 
wildcards:

# Set parameters
par.fund <- "20E"; par.func <- "100000"; par.obj <- "000"
par.fund <- glob2rx(gsub("0", "?", par.fund))
par.func <- glob2rx(gsub("0", "?", par.func))
par.obj <- glob2rx(gsub("0", "?", par.obj))

Fortunately, Bill's suggestion to use the intersect function worked just 
fine--since intersect accepts only two arguments, I had to nest a pair of 
statements:

# Solution: Use a pair of nested intersects
dt2 <- dt[intersect(intersect(grep(par.fund, fund), grep(par.func, func)),
                    grep(par.obj, obj)),
          sum(amount), by=c('code', 'year')]
df2 <- ddply(df[intersect(intersect(grep(par.fund, df$fund),
                                    grep(par.func, df$func)),
                          grep(par.obj, df$obj)), ],
             .(code, year), summarize, amount = sum(amount))

Thanks for your ideas!

DB

Daniel Bush | School Finance Consultant 
School Financial Services | Wis. Dept. of Public Instruction 
daniel.bush -at- dpi.wi.gov | 608-267-9212

-----Original Message-----
From: William Dunlap [mailto:[email protected]] 
Sent: Thursday, March 14, 2013 5:49 PM
To: Bush, Daniel P. DPI; '[email protected]'
Subject: RE: Grep with wildcards across multiple columns

grep(pattern, textVector) returns of the integer indices of the elements of 
textVector that match the pattern.  E.g.,
  > grep("T", c("One","Two","Three","Four"))
  [1] 2 3

The '&' operator naturally operates on logical vectors of the same length (If 
you give it numbers it silently converts 0 to FALSE and  other numbers to TRUE.)

The two don't fit together.  You could use grepl(), which returns a logical 
vector the length of textVector, as in
   grepl(p1,v1) & grepl(p2,v2)
to figure which entries in the table have v1 matching p1 and v2 matching p2.

Or, you could use
  intersect(grep(p1,v1), grep(p2,v2))
if you want to stick with integer indices.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Bush, Daniel P. DPI
> Sent: Thursday, March 14, 2013 2:43 PM
> To: '[email protected]'
> Subject: [R] Grep with wildcards across multiple columns
> 
> I have a fairly large data set with six variables set up like the following 
> dummy:
> 
> # Create fake data
> df <- data.frame(code   = c(rep(1001, 8), rep(1002, 8)),
>                  year   = rep(c(rep(2011, 4), rep(2012, 4)), 2),
>                  fund   = rep(c("10E", "10E", "10E", "27E"), 4),
>                  func   = rep(c("110000", "122000", "214000", "158000"), 4),
>                  obj    = rep("100", 16),
>                  amount = round(rnorm(16, 50000, 10000)))
> 
> What I would like to do is sum the amount variable by code and year, 
> filtering rows using different wildcard searches in each of three 
> columns: "1?E" in fund, "1??????" in func, and "???" in obj. I'm OK turning 
> these into regular expressions:
> 
> # Set parameters
> par.fund <- "10E"; par.func <- "100000"; par.obj <- "000"
> par.fund <- glob2rx(gsub("0", "?", par.fund)) par.func <- 
> glob2rx(gsub("0", "?", par.func)) par.obj <- glob2rx(gsub("0", "?", 
> par.obj))
> 
> The problem occurs when I try to apply multiple greps across columns. 
> I'd prefer to use data.table since it's so much faster than plyr and I 
> have 159 different sets of parameters to run through, but I get the same 
> error setting it up either way:
> 
> # Doesn't work
> library(data.table)
> dt <- data.table(df)
> eval(parse(text=paste(
>   "dt2 <- dt[", "grep('", par.fund, "', fund) & ",
>   "grep('", par.func, "', func) & grep('", par.obj, "', obj)",
>   ", sum(amount), by=c('code', 'year')]" , sep=""))) # Warning 
> message:
> #   In grep("^1.E$", fund) & grep("^1.....$", func) :
> #   longer object length is not a multiple of shorter object length
> 
> # Also doesn't work
> library(plyr)
> eval(parse(text=paste(
>   "df2 <- ddply(df[grep('", par.fund, "', df$fund) & ",
>   "grep('", par.func, "', df$func) & grep('", par.obj, "', df$obj), ]",
>   ", .(code, year), summarize, amount = sum(amount))" , sep=""))) # 
> Warning message:
> #   In grep("^1.E$", df$fund) & grep("^1.....$", df$func) :
> #   longer object length is not a multiple of shorter object length
> 
> Clearly, the problem is how I'm trying to combine greps in subsetting 
> rows, but I haven't been able to find a solution that works. Any 
> thoughts-preferably something that works with data.table?
> 
> DB
> 
> Daniel Bush
> School Finance Consultant
> School Financial Services
> Wisconsin Department of Public Instruction PO Box 7841 | Madison, WI 
> 53707-7841 daniel.bush -at- dpi.wi.gov | sfs.dpi.wi.gov
> Ph: 608-267-9212 | Fax: 608-266-2840


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Grep with wildcards across multiple columns

Reply via email to