Thanks Bill! Works great! Thanks again guys!
On Fri, Aug 10, 2012 at 2:43 PM, William Dunlap <[email protected]> wrote:
> If you think about this as a runs problem you can get a loopless solution
> that I think is easier to read (once the requisite functions are defined).
>
> First define the function to canonicalize the name
> nickname <- function(x) sub(" .*", "", x)
> then define some handy runs functions
> isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
> isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical
> then use those functions on your dataset
> > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR)
> > d[ nearDup | isJustBefore(nearDup), ]
> ID NAME YEAR SOURCE
> 1 1 New York Mets 1900 ESPN
> 2 2 New York Yankees 1920 Cooperstown
> See how it works with triplicates as well
> > dd <- rbind(d, data.frame(ID=6:8,
> NAME=c("Chicago Blacksox", "Chicago Cubs",
> "Chicago Whitesox"),
> YEAR=1701:1703, SOURCE=rep("made up", 3)))
> > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR)
> > dd[ nearDup | isJustBefore(nearDup), ]
> ID NAME YEAR SOURCE
> 1 1 New York Mets 1900 ESPN
> 2 2 New York Yankees 1920 Cooperstown
> 6 6 Chicago Blacksox 1701 made up
> 7 7 Chicago Cubs 1702 made up
> 8 8 Chicago Whitesox 1703 made up
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> On Behalf
> > Of Rui Barradas
> > Sent: Friday, August 10, 2012 11:18 AM
> > To: Fred G
> > Cc: r-help
> > Subject: Re: [R] Regular Expressions + Matrices
> >
> > Hello,
> >
> > Try the following.
> >
> >
> > d <- read.table(textConnection("
> > ID NAME YEAR SOURCE
> > 1 'New York Mets' 1900 ESPN
> > 2 'New York Yankees' 1920 Cooperstown
> > 3 'Boston Redsox' 1918 ESPN
> > 4 'Washington Nationals' 2010 ESPN
> > 5 'Detroit Tigers' 1990 ESPN
> > "), header=TRUE)
> >
> > d$NAME <- as.character(d$NAME)
> >
> > fun <- function(i, x){
> > if(x[i, "ID"] != x[i + 1, "ID"]){
> > s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]
> > if(grepl(s, x[i + 1, "NAME"])) return(TRUE)
> > }
> > FALSE
> > }
> >
> > inx <- sapply(seq_len(nrow(d) - 1), fun, d)
> > inx <- c(inx, FALSE) | c(FALSE, inx)
> > d[inx, ]
> >
> > Hope this helps,
> >
> > Rui Barradas
> > Em 10-08-2012 18:41, Fred G escreveu:
> > > Hi all,
> > >
> > > My code looks like the following:
> > > inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> > > outname = read.csv("output.csv", as.is=TRUE)
> > >
> > > #My algorithm is the following:
> > > #for line in inname
> > > #if first string up to whitespace in row in inname$name = first string
> up
> > > to whitespace in row + 1 in inname$name
> > > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the
> row
> > > below it
> > > #copy these two lines to a new file
> > >
> > > In other words, if the name (up to the first whitespace) in the first
> row
> > > equals the name in the second row (etc for whole file) and the ID in
> the
> > > first row does not equal the ID in the second row, copy both of these
> rows
> > > in full to a new file. Only caveat is that I want a regular
> expression not
> > > to take the full names, but just the first string up to the first
> > > whitespace in the inname$name column (ie if row1 has a name of: New
> York
> > > Mets and row2 has a name of New York Yankees, I would want both of
> these
> > > rows to be copied in full since "New" is the same in both...)
> > >
> > > Here is some example data:
> > > ID NAME YEAR SOURCE NOTES
> > > 1 New York Mets 1900 ESPN
> > > 2 New York Yankees 1920 Cooperstown
> > > 3 Boston Redsox 1918 ESPN
> > > 4 Washington Nationals 2010 ESPN
> > > 5 Detroit Tigers 1990 ESPN
> > >
> > > The desired output would be:
> > > ID NAME YEAR SOURCE
> > > 1 New York Mets 1900 ESPN
> > > 2 New York Yankees 1920 Cooperstown
> > >
> > > Thanks so much!
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [email protected] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > [email protected] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.