Run this: >p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <- >data.frame(Promoter=p, ip=a) # Note duplicate empty names in p. >p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <- >data.frame(Promoter=p, ip=a) >all <- merge(x=d1, y=d2, by="Promoter", all=T) >all <- merge(x=all, y=d2, by="Promoter", all=T) >all
Data is this: >d1 > Promoter ip >1 a 10 >2 c 20 >3 30 >4 40 > >d2 > Promoter ip >1 b 15 >2 c 20 >3 d 30 >4 40 Output looks like this: > Promoter ip.x ip.y ip >1 40 30 30 >2 40 40 30 >3 40 30 40 >4 40 40 40 >5 b 15 NA NA >6 c 20 20 20 >7 d 30 NA NA >8 a NA 10 10 The weird thing about this is (in my view) that each instance of '' is considered unique, so with each successive merge, all combinatorial possibilities are explored, like a SQL outer join (Cartesian product). For non-empty names, an inner join is performed. Dealing with genomic data (10^4 datapoints), it's easy to have a couple of blanks buried in the middle of things, and to combine several replicates with successive merges. I couldn't understand how my three replicates of 6000 points, in which I expected substantial overlap in the labels, were taking so long to merge and ultimately generating 57000 labels. The culprit turned out to be a few hundred blanks buried in the middle. Why does the empty ("null") name merit special treatment? Perhaps I'm missing something. I hesitate to submit this as a bug, since technically I guess you could say that blank names, especially duplicates, are not kosher. But on the other hand, this combinatorial behaviour seems to occur only for blanks. -Frank PhD, Computational Biologist, Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA. Tel: 617-432-3555 Fax: 617-432-3557 http://llama.med.harvard.edu/~fgibbons ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html