Re: [R] Need a vectorized way to avoid two nested FOR loops
Here is one way of doing it: n - 20 set.seed(2) # create test dataframe x - as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n)) x V1 V2 V3 V4 V5 V6 1 1 2 2 2 1 1 2 2 1 1 2 2 1 3 2 2 1 2 1 2 4 1 1 1 1 1 2 5 2 1 2 2 1 1 6 2 1 2 1 2 2 7 1 1 2 1 2 2 8 2 1 1 1 1 1 9 1 2 2 1 2 1 10 2 1 2 1 1 1 11 2 1 1 1 2 1 12 1 1 1 1 1 2 13 2 2 2 1 1 1 14 1 2 2 1 2 2 15 1 2 1 1 1 2 16 2 2 2 2 1 2 17 2 2 2 1 1 2 18 1 1 2 2 1 1 19 1 2 2 1 1 2 20 1 1 2 2 1 2 x.col - c(1,3,5) # find matching columns by testing the first against all others x.match - x[, x.col[1]] == x[, x.col[-1]] # print them out x[apply(x.match, 1, all),] V1 V2 V3 V4 V5 V6 4 1 1 1 1 1 2 6 2 1 2 1 2 2 12 1 1 1 1 1 2 15 1 2 1 1 1 2 On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan r...@alum.mit.edu wrote: Hi Friends, I have a data frame d. Let vars be the column indices for a subset of the columns in d (e.g., vars - c(1,3,4,8)) For each row r in d, I want to collect all the other rows in d that match the values in row r for just the columns in vars. The naive way to do this is to have a for loop stepping through each row in d, and within the loop have another loop going through all the rows again, checking for equality. This is quadratic in the number of rows and takes way too long. Is there a better, vectorized way to do this? Thanks in advance! Rama Ramakrishnan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need a vectorized way to avoid two nested FOR loops
I answered the wrong question. Here is the code to find all the matches for each row: n - 20 set.seed(2) # create test dataframe x - as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n)) x x.col - c(1,3,5) # match against all the other rows x.match1 - apply(x[, x.col], 1, function(a){ .mat - which(apply(x[, x.col], 1, function(z){ all(a == z) })) }) # remove matches to itself x.match2 - lapply(seq(length(x.match1)), function(z){ x.match1[[z]][!(x.match1[[z]] %in% z)] }) # x.match2 contains which rows indices match On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan r...@alum.mit.edu wrote: Hi Friends, I have a data frame d. Let vars be the column indices for a subset of the columns in d (e.g., vars - c(1,3,4,8)) For each row r in d, I want to collect all the other rows in d that match the values in row r for just the columns in vars. The naive way to do this is to have a for loop stepping through each row in d, and within the loop have another loop going through all the rows again, checking for equality. This is quadratic in the number of rows and takes way too long. Is there a better, vectorized way to do this? Thanks in advance! Rama Ramakrishnan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need a vectorized way to avoid two nested FOR loops
Another approach is: n - 20 set.seed(2) x - as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n)) x.col - c(1, 3, 5) values - do.call(paste, c(x[x.col], sep = \r)) out - lapply(seq_along(ind), function (i) { ind - which(values == values[i]) ind[!ind %in% i] }) out Best, Dimitris joris meys wrote: Neat piece of code, Jim, but it still uses a nested loop. If you order the matrix first, you only need one passage through the whole matrix to find the information you need. Off course I don't take into account the ordering. If the ordering algorithm doesn't work in linear time, then it doesn't really matter I guess. The limiting step would become the ordering algorithm. Kind regards Joris On Thu, Oct 8, 2009 at 2:24 PM, jim holtman jholt...@gmail.com wrote: I answered the wrong question. Here is the code to find all the matches for each row: n - 20 set.seed(2) # create test dataframe x - as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n)) x x.col - c(1,3,5) # match against all the other rows x.match1 - apply(x[, x.col], 1, function(a){ .mat - which(apply(x[, x.col], 1, function(z){ all(a == z) })) }) # remove matches to itself x.match2 - lapply(seq(length(x.match1)), function(z){ x.match1[[z]][!(x.match1[[z]] %in% z)] }) # x.match2 contains which rows indices match On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan r...@alum.mit.edu wrote: Hi Friends, I have a data frame d. Let vars be the column indices for a subset of the columns in d (e.g., vars - c(1,3,4,8)) For each row r in d, I want to collect all the other rows in d that match the values in row r for just the columns in vars. The naive way to do this is to have a for loop stepping through each row in d, and within the loop have another loop going through all the rows again, checking for equality. This is quadratic in the number of rows and takes way too long. Is there a better, vectorized way to do this? Thanks in advance! Rama Ramakrishnan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need a vectorized way to avoid two nested FOR loops
If I understand your intent, I believe you can get what you want much faster (no interpreted loops and linear times) by looking at this slightly differently. First of all, the choice of columns is unimportant, as indexing can be used to create a data frame containing only the columns of interest. So I think you can abstract your request to: group the rows of a data frame so that all rows in a group match. Now the problem here is exactly what you mean by match. If the data are numeric, finite precision arithmetic requires one to ask whether you mean **exactly equal** or just equal within a tolerance. I shall assume the former, but the latter is often what one wants. It is a little more difficult to handle, but one way to do it with the present approach is to first round to a few digits that represent the tolerance and then proceed with the rounded values. As always (and as recommended by the posting guide !) a small reproducible example is helpful: ## Create a data frame with groups of identical rows. z - data.frame(matrix(rnorm(60),ncol=3))[sample(20,50,repl=TRUE),] ## now create a factor column of id's in which identical columns ## have identical id's (a hash) id - factor(do.call(paste,c(z,sep=+))) ## The levels of the factors now index groups of rows that match ## They can be easily accessed in a variety of way, e.g. as.numeric(id) ## gives all rows of each group of matching rows ## the same integer index. etc. This all requires only linear time. Hope this helps -- or my apologies if I have misinterpreted what was requested. Bert Gunter Genentech Nonclinical Biostatistics -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Dimitris Rizopoulos Sent: Thursday, October 08, 2009 6:28 AM To: joris meys Cc: r-help@r-project.org; Rama Ramakrishnan Subject: Re: [R] Need a vectorized way to avoid two nested FOR loops Another approach is: n - 20 set.seed(2) x - as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n)) x.col - c(1, 3, 5) values - do.call(paste, c(x[x.col], sep = \r)) out - lapply(seq_along(ind), function (i) { ind - which(values == values[i]) ind[!ind %in% i] }) out Best, Dimitris joris meys wrote: Neat piece of code, Jim, but it still uses a nested loop. If you order the matrix first, you only need one passage through the whole matrix to find the information you need. Off course I don't take into account the ordering. If the ordering algorithm doesn't work in linear time, then it doesn't really matter I guess. The limiting step would become the ordering algorithm. Kind regards Joris On Thu, Oct 8, 2009 at 2:24 PM, jim holtman jholt...@gmail.com wrote: I answered the wrong question. Here is the code to find all the matches for each row: n - 20 set.seed(2) # create test dataframe x - as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n)) x x.col - c(1,3,5) # match against all the other rows x.match1 - apply(x[, x.col], 1, function(a){ .mat - which(apply(x[, x.col], 1, function(z){ all(a == z) })) }) # remove matches to itself x.match2 - lapply(seq(length(x.match1)), function(z){ x.match1[[z]][!(x.match1[[z]] %in% z)] }) # x.match2 contains which rows indices match On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan r...@alum.mit.edu wrote: Hi Friends, I have a data frame d. Let vars be the column indices for a subset of the columns in d (e.g., vars - c(1,3,4,8)) For each row r in d, I want to collect all the other rows in d that match the values in row r for just the columns in vars. The naive way to do this is to have a for loop stepping through each row in d, and within the loop have another loop going through all the rows again, checking for equality. This is quadratic in the number of rows and takes way too long. Is there a better, vectorized way to do this? Thanks in advance! Rama Ramakrishnan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center
Re: [R] Need a vectorized way to avoid two nested FOR loops
Bert, Jim, Dimitris and Joris, Thank you all very much for your prompt help and suggestions. After trying the ideas out, I have decided to go with Bert's approach since it is by far the fastest of the lot. Thanks again! Rama Ramakrishnan On Oct 8, 2009, at 12:49 PM, Bert Gunter wrote: If I understand your intent, I believe you can get what you want much faster (no interpreted loops and linear times) by looking at this slightly differently. First of all, the choice of columns is unimportant, as indexing can be used to create a data frame containing only the columns of interest. So I think you can abstract your request to: group the rows of a data frame so that all rows in a group match. Now the problem here is exactly what you mean by match. If the data are numeric, finite precision arithmetic requires one to ask whether you mean **exactly equal** or just equal within a tolerance. I shall assume the former, but the latter is often what one wants. It is a little more difficult to handle, but one way to do it with the present approach is to first round to a few digits that represent the tolerance and then proceed with the rounded values. As always (and as recommended by the posting guide !) a small reproducible example is helpful: ## Create a data frame with groups of identical rows. z - data.frame(matrix(rnorm(60),ncol=3))[sample(20,50,repl=TRUE),] ## now create a factor column of id's in which identical columns ## have identical id's (a hash) id - factor(do.call(paste,c(z,sep=+))) ## The levels of the factors now index groups of rows that match ## They can be easily accessed in a variety of way, e.g. as.numeric(id) ## gives all rows of each group of matching rows ## the same integer index. etc. This all requires only linear time. Hope this helps -- or my apologies if I have misinterpreted what was requested. Bert Gunter Genentech Nonclinical Biostatistics -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Dimitris Rizopoulos Sent: Thursday, October 08, 2009 6:28 AM To: joris meys Cc: r-help@r-project.org; Rama Ramakrishnan Subject: Re: [R] Need a vectorized way to avoid two nested FOR loops Another approach is: n - 20 set.seed(2) x - as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n)) x.col - c(1, 3, 5) values - do.call(paste, c(x[x.col], sep = \r)) out - lapply(seq_along(ind), function (i) { ind - which(values == values[i]) ind[!ind %in% i] }) out Best, Dimitris joris meys wrote: Neat piece of code, Jim, but it still uses a nested loop. If you order the matrix first, you only need one passage through the whole matrix to find the information you need. Off course I don't take into account the ordering. If the ordering algorithm doesn't work in linear time, then it doesn't really matter I guess. The limiting step would become the ordering algorithm. Kind regards Joris On Thu, Oct 8, 2009 at 2:24 PM, jim holtman jholt...@gmail.com wrote: I answered the wrong question. Here is the code to find all the matches for each row: n - 20 set.seed(2) # create test dataframe x - as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n)) x x.col - c(1,3,5) # match against all the other rows x.match1 - apply(x[, x.col], 1, function(a){ .mat - which(apply(x[, x.col], 1, function(z){ all(a == z) })) }) # remove matches to itself x.match2 - lapply(seq(length(x.match1)), function(z){ x.match1[[z]][!(x.match1[[z]] %in% z)] }) # x.match2 contains which rows indices match On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan r...@alum.mit.edu wrote: Hi Friends, I have a data frame d. Let vars be the column indices for a subset of the columns in d (e.g., vars - c(1,3,4,8)) For each row r in d, I want to collect all the other rows in d that match the values in row r for just the columns in vars. The naive way to do this is to have a for loop stepping through each row in d, and within the loop have another loop going through all the rows again, checking for equality. This is quadratic in the number of rows and takes way too long. Is there a better, vectorized way to do this? Thanks in advance! Rama Ramakrishnan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list
Re: [R] Need a vectorized way to avoid two nested FOR loops
Bert Gunter wrote: If I understand your intent, I believe you can get what you want much faster (no interpreted loops and linear times) by looking at this slightly differently. First of all, the choice of columns is unimportant, as indexing can be used to create a data frame containing only the columns of interest. So I think you can abstract your request to: group the rows of a data frame so that all rows in a group match. Now the problem here is exactly what you mean by match. If the data are numeric, finite precision arithmetic requires one to ask whether you mean **exactly equal** or just equal within a tolerance. I shall assume the former, but the latter is often what one wants. It is a little more difficult to handle, but one way to do it with the present approach is to first round to a few digits that represent the tolerance and then proceed with the rounded values. As always (and as recommended by the posting guide !) a small reproducible example is helpful: ## Create a data frame with groups of identical rows. z - data.frame(matrix(rnorm(60),ncol=3))[sample(20,50,repl=TRUE),] ## now create a factor column of id's in which identical columns ## have identical id's (a hash) id - factor(do.call(paste,c(z,sep=+))) ## The levels of the factors now index groups of rows that match ## They can be easily accessed in a variety of way, e.g. as.numeric(id) well, if this is indeed the case, then it is not even required to convert it to a factor, i.e., vals - do.call(paste,c(z,sep=+)) match(vals, sort(unique(vals))) # or even match(vals, unique(vals)) which I think is more efficient. However, I'm not sure if this is what was in fact requested. Best, Dimitris ## gives all rows of each group of matching rows ## the same integer index. etc. This all requires only linear time. Hope this helps -- or my apologies if I have misinterpreted what was requested. Bert Gunter Genentech Nonclinical Biostatistics -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Dimitris Rizopoulos Sent: Thursday, October 08, 2009 6:28 AM To: joris meys Cc: r-help@r-project.org; Rama Ramakrishnan Subject: Re: [R] Need a vectorized way to avoid two nested FOR loops Another approach is: n - 20 set.seed(2) x - as.data.frame(matrix(sample(1:2, n*6, TRUE), nrow = n)) x.col - c(1, 3, 5) values - do.call(paste, c(x[x.col], sep = \r)) out - lapply(seq_along(ind), function (i) { ind - which(values == values[i]) ind[!ind %in% i] }) out Best, Dimitris joris meys wrote: Neat piece of code, Jim, but it still uses a nested loop. If you order the matrix first, you only need one passage through the whole matrix to find the information you need. Off course I don't take into account the ordering. If the ordering algorithm doesn't work in linear time, then it doesn't really matter I guess. The limiting step would become the ordering algorithm. Kind regards Joris On Thu, Oct 8, 2009 at 2:24 PM, jim holtman jholt...@gmail.com wrote: I answered the wrong question. Here is the code to find all the matches for each row: n - 20 set.seed(2) # create test dataframe x - as.data.frame(matrix(sample(1:2,n*6, TRUE), nrow=n)) x x.col - c(1,3,5) # match against all the other rows x.match1 - apply(x[, x.col], 1, function(a){ .mat - which(apply(x[, x.col], 1, function(z){ all(a == z) })) }) # remove matches to itself x.match2 - lapply(seq(length(x.match1)), function(z){ x.match1[[z]][!(x.match1[[z]] %in% z)] }) # x.match2 contains which rows indices match On Wed, Oct 7, 2009 at 3:52 PM, Rama Ramakrishnan r...@alum.mit.edu wrote: Hi Friends, I have a data frame d. Let vars be the column indices for a subset of the columns in d (e.g., vars - c(1,3,4,8)) For each row r in d, I want to collect all the other rows in d that match the values in row r for just the columns in vars. The naive way to do this is to have a for loop stepping through each row in d, and within the loop have another loop going through all the rows again, checking for equality. This is quadratic in the number of rows and takes way too long. Is there a better, vectorized way to do this? Thanks in advance! Rama Ramakrishnan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list
[R] Need a vectorized way to avoid two nested FOR loops
Hi Friends, I have a data frame d. Let vars be the column indices for a subset of the columns in d (e.g., vars - c(1,3,4,8)) For each row r in d, I want to collect all the other rows in d that match the values in row r for just the columns in vars. The naive way to do this is to have a for loop stepping through each row in d, and within the loop have another loop going through all the rows again, checking for equality. This is quadratic in the number of rows and takes way too long. Is there a better, vectorized way to do this? Thanks in advance! Rama Ramakrishnan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.