Re: [R] Contatenating data frames with partial overlap in variable names

Daniel Folkinshteyn Sat, 24 Mar 2007 19:38:47 -0800

on 03/24/2007 11:13 PM Marc Schwartz said the following:
> On Sat, 2007-03-24 at 22:16 -0400, Daniel Folkinshteyn wrote:
>> on 03/24/2007 10:00 PM Marc Schwartz said the following:
>>> On Sat, 2007-03-24 at 21:47 -0400, Daniel Folkinshteyn wrote:
>>>> Greetings to all.
>>>> I need to concatenate data frames that do not have all the same variable
>>>> names, there is only a partial overlap in the variables. So, for
>>>> example, if i have two data frames, a and b, that look like the following:
>>>>> a
>>>>   a b
>>>> 1 1 4
>>>> 2 2 5
>>>> 3 3 6
>>>> 4 4 7
>>>> 5 5 8
>>>>> b
>>>>   c  a
>>>> 1 1 10
>>>> 2 2 11
>>>> 3 3 12
>>>> 4 4 13
>>>> 5 5 14
>>>>
>>>> i want to concatenate them by row, without any matching, so that the
>>>> variables that are not available in all frames get NAs. The result
>>>> should look like:
>>>>
>>>>    a  b  c
>>>> 1  1  4  NA
>>>> 2  2  5  NA
>>>> 3  3  6  NA
>>>> 4  4  7  NA
>>>> 5  5  8  NA
>>>> 6  10 NA 1
>>>> 7  11 NA 2
>>>> 8  12 NA 3
>>>> 9  13 NA 4
>>>> 10 14 NA 5
>>>>
>>>> rbind doesn't work, since it requires all variables to be matched
>>>> between the two data frames. merge doesn't work, since it wants to
>>>> /match/ by columns with the same name, and if matching by nothing,
>>>> produces a cartesian product.
>>>>
>>>> is there a neat trick for doing this simply, or am i stuck with
>>>> comparing variable lists and generating NAs manually?
>>>>
>>>> would appreciate any help!
>>>> Daniel
>>> You can use merge():
>>>
>>>> a
>>>   a b
>>> 1 1 4
>>> 2 2 5
>>> 3 3 6
>>> 4 4 7
>>> 5 5 8
>>>
>>>> b
>>>   c  a
>>> 1 1 10
>>> 2 2 11
>>> 3 3 12
>>> 4 4 13
>>> 5 5 14
>>>
>>>
>>> Use 'a' as the common 'by' column and specify 'all = TRUE' so that
>>> non-matching values of 'a' will be included in the result:
>>>
>>>
>>>> merge(a, b, by = "a", all = TRUE)
>>>     a  b  c
>>> 1   1  4 NA
>>> 2   2  5 NA
>>> 3   3  6 NA
>>> 4   4  7 NA
>>> 5   5  8 NA
>>> 6  10 NA  1
>>> 7  11 NA  2
>>> 8  12 NA  3
>>> 9  13 NA  4
>>> 10 14 NA  5
>>>
>> Thanks for your quick response. Unfortunately, this is still not quite
>> what I have in mind (though maybe it's my fault for not making this too
>> clear). Even if the two data frames happen to have some values of 'a'
>> that match, I still want those records to remain separate, rather than
>> merge. So, for instance, using merge will produce the following:
>>> a = data.frame(a=1:5, b=4:8)
>>> a
>>   a b
>> 1 1 4
>> 2 2 5
>> 3 3 6
>> 4 4 7
>> 5 5 8
>>> b = data.frame(c=1:5, a=4:8)
>>> b
>>   c a
>> 1 1 4
>> 2 2 5
>> 3 3 6
>> 4 4 7
>> 5 5 8
>>> merge(a,b,by='a',all=T)
>>   a  b  c
>> 1 1  4 NA
>> 2 2  5 NA
>> 3 3  6 NA
>> 4 4  7  1
>> 5 5  8  2
>> 6 6 NA  3
>> 7 7 NA  4
>> 8 8 NA  5
>>
>> whereas I would still want it to produce 10 separate rows, because they
>> are separate observations, it's just that one of them happens to be
>> missing a variable.
> 
> OK. Not sure if this is the most efficient way of doing this, but this
> seems to work, though through very limited testing.
> 
> Basically what I am doing is using setdiff() to figure out which columns
> are not common between the two data frames. In each case, I then use
> sapply() to loop over the results, creating a new column of NA's that
> will be cbind()ed back to the original data frame.
> 
> Once that is done, the two new data frames, a.2 and b.2, will have
> common columns and they can then be rbind()ed.
> 
> 
> a.2 <- cbind(a, sapply(setdiff(colnames(b), colnames(a)), 
>                        function(x) x = rep(NA, nrow(a))))
> 
> b.2 <- cbind(b, sapply(setdiff(colnames(a), colnames(b)), 
>                        function(x) x = rep(NA, nrow(b))))
> 
>> a.2
>   a b  c
> 1 1 4 NA
> 2 2 5 NA
> 3 3 6 NA
> 4 4 7 NA
> 5 5 8 NA
> 
>> b.2
>   c a  b
> 1 1 4 NA
> 2 2 5 NA
> 3 3 6 NA
> 4 4 7 NA
> 5 5 8 NA
> 
> 
>> rbind(a.2, b.2)
>    a  b  c
> 1  1  4 NA
> 2  2  5 NA
> 3  3  6 NA
> 4  4  7 NA
> 5  5  8 NA
> 6  4 NA  1
> 7  5 NA  2
> 8  6 NA  3
> 9  7 NA  4
> 10 8 NA  5
> 
> 
> Hopefully that will work in more general cases, but I would validate
> that.


Thanks Marc, that seems like a pretty elegant solution, and I have
learned some useful stuff from it.

However, I will go with rbind.fill(reshape) that was recommended by
Hadley (thanks!), since it's just so darn easy. :)

Thank you all,
Daniel

signature.asc
Description: OpenPGP digital signature

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Contatenating data frames with partial overlap in variable names

Reply via email to