Your question mystifies me, since it looks to me like you already know the answer. -- Sent from my phone. Please excuse my brevity.
On February 12, 2017 3:30:49 PM PST, Val <valkr...@gmail.com> wrote: >Hi Jeff and all, > How do I get the number of unique first names in the two data sets? > >for the first one, >result2 <- DF[ 1 == err2, ] >length(unique(result2$first)) > > > > >On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller ><jdnew...@dcn.davis.ca.us> wrote: >> The "by" function aggregates and returns a result with generally >fewer rows >> than the original data. Since you are looking to index the rows in >the >> original data set, the "ave" function is better suited because it >always >> returns a vector that is just as long as the input vector: >> >> # I usually work with character data rather than factors if I plan >> # to modify the data (e.g. removing rows) >> DF <- read.table( text= >> 'first week last >> Alex 1 West >> Bob 1 John >> Cory 1 Jack >> Cory 2 Jack >> Bob 2 John >> Bob 3 John >> Alex 2 Joseph >> Alex 3 West >> Alex 4 West >> ', header = TRUE, as.is = TRUE ) >> >> err <- ave( DF$last >> , DF[ , "first", drop = FALSE] >> , FUN = function( lst ) { >> length( unique( lst ) ) >> } >> ) >> result <- DF[ "1" == err, ] >> result >> >> Notice that the ave function returns a vector of the same type as was >given >> to it, so even though the function returns a numeric the err >> vector is character. >> >> If you wanted to be able to examine more than one other column in >> determining the keep/reject decision, you could do: >> >> err2 <- ave( seq_along( DF$first ) >> , DF[ , "first", drop = FALSE] >> , FUN = function( n ) { >> length( unique( DF[ n, "last" ] ) ) >> } >> ) >> result2 <- DF[ 1 == err2, ] >> result2 >> >> and then you would have the option to re-use the "n" index to look at >other >> columns as well. >> >> Finally, here is a dplyr solution: >> >> library(dplyr) >> result3 <- ( DF >> %>% group_by( first ) # like a prep for ave or by >> %>% mutate( err = length( unique( last ) ) ) # similar to >ave >> %>% filter( 1 == err ) # drop the rows with too many last >names >> %>% select( -err ) # drop the temporary column >> %>% as.data.frame # convert back to a plain-jane data >frame >> ) >> result3 >> >> which uses a small set of verbs in a pipeline of functions to go from >input >> to result in one pass. >> >> If your data set is really big (running out of memory big) then you >might >> want to investigate the data.table or sqlite packages, either of >which can >> be combined with dplyr to get a standardized syntax for managing >larger >> amounts of data. However, most people actually aren't running out of >memory >> so in most cases the extra horsepower isn't actually needed. >> >> >> On Sun, 12 Feb 2017, P Tennant wrote: >> >>> Hi Val, >>> >>> The by() function could be used here. With the dataframe dfr: >>> >>> # split the data by first name and check for more than one last name >for >>> each first name >>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >>> # make the result more easily manipulated >>> res <- as.table(res) >>> res >>> # first >>> # Alex Bob Cory >>> # TRUE FALSE FALSE >>> >>> # then use this result to subset the data >>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >>> # sort if needed >>> nw.dfr[order(nw.dfr$first) , ] >>> >>> first week last >>> 2 Bob 1 John >>> 5 Bob 2 John >>> 6 Bob 3 John >>> 3 Cory 1 Jack >>> 4 Cory 2 Jack >>> >>> >>> Philip >>> >>> On 12/02/2017 4:02 PM, Val wrote: >>>> >>>> Hi all, >>>> I have a big data set and want to remove rows conditionally. >>>> In my data file each person were recorded for several weeks. >Somehow >>>> during the recording periods, their last name was misreported. >For >>>> each person, the last name should be the same. Otherwise remove >from >>>> the data. Example, in the following data set, Alex was found to >have >>>> two last names . >>>> >>>> Alex West >>>> Alex Joseph >>>> >>>> Alex should be removed from the data. if this happens then I want >>>> remove all rows with Alex. Here is my data set >>>> >>>> df<- read.table(header=TRUE, text='first week last >>>> Alex 1 West >>>> Bob 1 John >>>> Cory 1 Jack >>>> Cory 2 Jack >>>> Bob 2 John >>>> Bob 3 John >>>> Alex 2 Joseph >>>> Alex 3 West >>>> Alex 4 West ') >>>> >>>> Desired output >>>> >>>> first week last >>>> 1 Bob 1 John >>>> 2 Bob 2 John >>>> 3 Bob 3 John >>>> 4 Cory 1 Jack >>>> 5 Cory 2 Jack >>>> >>>> Thank you in advance >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >--------------------------------------------------------------------------- >> Jeff Newmiller The ..... ..... Go >Live... >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >Go... >> Live: OO#.. Dead: OO#.. >Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >rocks...1k >> >--------------------------------------------------------------------------- ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.