On Nov 13, 2011, at 1:19 AM, Joshua Wiley wrote:
Hi Carlos,
Here is one option:
## read in your data
dat <- read.table(textConnection("
obs unit home z sex age
1 015029 18 1 1 053
2 015029 18 1 2 049
3 015029 01 1 1 038
4 015029 01 1 2 033
5 015029 02 1 1 036
6 015029 02 1 2 033
7 015029 03 1 1 023
8 015029 03 1 2 019
9 015029 04 1 2 045
10 015029 05 1 2 047"),
header = TRUE, stringsAsFactors = FALSE)
closeAllConnections()
## create a unique ID for matching unit and home
I would have used the 'interaction' function. And I would have read
the data in with colClasses to preserve the labeling of the household
information:
> dat <- read.table(textConnection("
+ obs unit home z sex age
+ 1 015029 18 1 1 053
+ 2 015029 18 1 2 049
+ 3 015029 01 1 1 038
+ 4 015029 01 1 2 033
+ 5 015029 02 1 1 036
+ 6 015029 02 1 2 033
+ 7 015029 03 1 1 023
+ 8 015029 03 1 2 019
+ 9 015029 04 1 2 045
+ 10 015029 05 1 2 047"),
+ header = TRUE, stringsAsFactors = FALSE, colClasses=c(rep("factor",
4), "numeric"))
> closeAllConnections()
>
> ## create a unique ID for matching unit and home
> dat
obs unit home z sex age
1 1 015029 18 1 1 053
2 2 015029 18 1 2 049
3 3 015029 01 1 1 038
4 4 015029 01 1 2 033
5 5 015029 02 1 1 036
6 6 015029 02 1 2 033
7 7 015029 03 1 1 023
8 8 015029 03 1 2 019
9 9 015029 04 1 2 045
10 10 015029 05 1 2 047
> dat$together <- with(dat, interaction(unit, home) )
> dat$togeth.n <- as.numeric(dat$together)
> dat
obs unit home z sex age together togeth.n
1 1 015029 18 1 1 053 015029.18 6
2 2 015029 18 1 2 049 015029.18 6
3 3 015029 01 1 1 038 015029.01 1
4 4 015029 01 1 2 033 015029.01 1
5 5 015029 02 1 1 036 015029.02 2
6 6 015029 02 1 2 033 015029.02 2
7 7 015029 03 1 1 023 015029.03 3
8 8 015029 03 1 2 019 015029.03 3
9 9 015029 04 1 2 045 015029.04 4
10 10 015029 05 1 2 047 015029.05 5
I'm assuming that the numbering of the unit/household pairings is
somewhat arbitrary. The I would set to missing all of the non-couple
households:
> is.na(dat$togett.n) <- !as.logical( ave(dat$sex, dat$together,
FUN=function(x) 1 %in% x & 2 %in% x) )
> dat
obs unit home z sex age together togett.n
1 1 015029 18 1 1 053 015029.18 6
2 2 015029 18 1 2 049 015029.18 6
3 3 015029 01 1 1 038 015029.01 1
4 4 015029 01 1 2 033 015029.01 1
5 5 015029 02 1 1 036 015029.02 2
6 6 015029 02 1 2 033 015029.02 2
7 7 015029 03 1 1 023 015029.03 3
8 8 015029 03 1 2 019 015029.03 3
9 9 015029 04 1 2 045 015029.04 NA
10 10 015029 05 1 2 047 015029.05 NA
I actually think it would be better to use the fully labeled
"together" variable rather than the numeric version. It retains its
data heritage better.
--
David.
dat$mID <- with(dat, paste(unit, home, sep = ''))
## somewhat messy way of creating a couple number
## for each mID, if there is more than 1 row, and more than 1 sex
## it creates a couple id, otherwise 0
i <- 0L
dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
i <<- i + 1L
if (length(x) > 1 && length(unique(x)) > 1) {
rep(i, length(x))
} else 0L
})))
## view results
dat
obs unit home z sex age mID couple
1 1 15029 18 1 1 53 1502918 1
2 2 15029 18 1 2 49 1502918 1
3 3 15029 1 1 1 38 150291 2
4 4 15029 1 1 2 33 150291 2
5 5 15029 2 1 1 36 150292 3
6 6 15029 2 1 2 33 150292 3
7 7 15029 3 1 1 23 150293 4
8 8 15029 3 1 2 19 150293 4
9 9 15029 4 1 2 45 150294 0
10 10 15029 5 1 2 47 150295 0
See these functions for more details:
?ave # where I got my idea
?split
?lapply
?`<<-`
Cheers,
Josh
On Sat, Nov 12, 2011 at 8:16 PM, jour4life <[email protected]>
wrote:
Hi all,
I've searched everywhere to try to find out how to do this and have
had no
luck. I am trying to construct identifiers for couples in a dataset.
Essentially, I want to identify couples using more than one column as
identifiers. Take for instance:
obs unit home z sex age
1 015029 18 1 1 053
2 015029 18 1 2 049
3 015029 01 1 1 038
4 015029 01 1 2 033
5 015029 02 1 1 036
6 015029 02 1 2 033
7 015029 03 1 1 023
8 015029 03 1 2 019
9 015029 04 1 2 045
10 015029 05 1 2 047
Where unit is the housing unit, home is household. Of course, there
are more
values for unit, although these first ten observations consist of
the same
unit (which could possibly be an apartment complex). Nonetheless, I
want to
construct an identifier for couples if unit, home match, but only
if both
male and female are within the same household. Taking the example
data
above, I want to see this:
unit home z sex age couple
1 015029 18 1 1 053 1
2 015029 18 1 2 049 1
3 015029 01 1 1 038 2
4 015029 01 1 2 033 2
5 015029 02 1 1 036 3
6 015029 02 1 2 033 3
7 015029 03 1 1 023 4
8 015029 03 1 2 019 4
9 015029 04 1 2 045 0
10 015029 05 1 2 047 0
As you can see in the last two observations, there were no males
identified
within the same household, thus the last two observations would not
contain
couple identifiers, rather some other identifier (but the same one)
so I can
detect them and remove them later. I've tried using the duplicated
function
but was not very useful.
Any help would be greatly appreciated!!!
Thanks,
Carlos
--
View this message in context:
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.