Hi Bogdan, Messy, and very specific to your problem: df.sample.gene<-read.table( text="Chr Start End Ref Alt Func.refGene Gene.refGene 284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 525 chr2 223777758 223777758 T A exonic AP1S3 626 chr3 99794575 99794575 G A exonic COL8A1 643 chr3 132601066 132601066 A G exonic ACKR4 655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6", header=TRUE,stringsAsFactors=FALSE)
multgenes<-grep(",",df.sample.gene$Gene.refGene) rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",") ngenes<-unlist(lapply(rep_genes,length)) dup_row<-function(x) { newrows<-x lastcol<-dim(x)[2] rep_genes<-unlist(strsplit(x[,lastcol],",")) for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x) newrows$Gene.refGene<-rep_genes return(newrows) } for(multgene in multgenes) df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,])) df.sample.gene<-df.sample.gene[-multgenes,] df.sample.gene I added a second line with multiple genes to make sure that it would work with more than one line. Jim On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tan...@gmail.com> wrote: > I would appreciate please a suggestion on how to do the following : > > i'm working with a dataframe in R that contains in a specific column > multiple gene names, eg : > >> df.sample.gene[15:20,2:8] > Chr Start End Ref Alt Func.refGene > Gene.refGene284 chr2 16080996 16080996 C T ncRNA_exonic > GACAT3448 chr2 113979920 113979920 C T ncRNA_exonic > LINC01191,LOC100499194465 chr2 131279347 131279347 C G > ncRNA_exonic LOC440910525 chr2 223777758 223777758 T > A exonic AP1S3626 chr3 99794575 99794575 G > A exonic COL8A1643 chr3 132601066 132601066 A > G exonic ACKR4 > > How could I obtain a dataframe where each line that has multiple gene names > (in the field Gene.refGene) is replicated with only one gene name ? i.e. > > for the second row : > > 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 > > we shall get in the final output (that contains all the rows) : > > 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191 > 448 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194 > > thanks a lot ! > > -- bogdan > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.