Very tidy. Amazing what is hidden away in R packages. Jim
On Sat, Aug 26, 2017 at 5:26 AM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > If row numbers can be dispensed with, then tidyr makes this easy with the > unnest function: > > ##### > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > library(purrr) > library(tidyr) > > df.sample.gene<-read.table( > text="Chr Start End Ref Alt Func.refGene Gene.refGene > 284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 > 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 > 465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 > 525 chr2 223777758 223777758 T A exonic AP1S3 > 626 chr3 99794575 99794575 G A exonic COL8A1 > 643 chr3 132601066 132601066 A G exonic ACKR4 > 655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6", > header=TRUE,stringsAsFactors=FALSE) > > df.sample.out <- ( df.sample.gene > %>% mutate( Gene.refGene = strsplit( Gene.refGene > , "," > ) > ) > %>% unnest( Gene.refGene ) > ) > df.sample.out > #> Chr Start End Ref Alt Func.refGene Gene.refGene > #> 1 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 > #> 2 chr2 113979920 113979920 C T ncRNA_exonic LINC01191 > #> 3 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194 > #> 4 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 > #> 5 chr2 223777758 223777758 T A exonic AP1S3 > #> 6 chr3 99794575 99794575 G A exonic COL8A1 > #> 7 chr3 132601066 132601066 A G exonic ACKR4 > #> 8 chr3 132601999 132601999 A G exonic BCDF5 > #> 9 chr3 132601999 132601999 A G exonic CDFG6 > ##### > > > On Wed, 23 Aug 2017, Jim Lemon wrote: > >> Hi Bogdan, >> Messy, and very specific to your problem: >> >> df.sample.gene<-read.table( >> text="Chr Start End Ref Alt Func.refGene Gene.refGene >> 284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 >> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 >> 465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 >> 525 chr2 223777758 223777758 T A exonic AP1S3 >> 626 chr3 99794575 99794575 G A exonic COL8A1 >> 643 chr3 132601066 132601066 A G exonic ACKR4 >> 655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6", >> header=TRUE,stringsAsFactors=FALSE) >> >> multgenes<-grep(",",df.sample.gene$Gene.refGene) >> rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",") >> ngenes<-unlist(lapply(rep_genes,length)) >> dup_row<-function(x) { >> newrows<-x >> lastcol<-dim(x)[2] >> rep_genes<-unlist(strsplit(x[,lastcol],",")) >> for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x) >> newrows$Gene.refGene<-rep_genes >> return(newrows) >> } >> for(multgene in multgenes) >> df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,])) >> df.sample.gene<-df.sample.gene[-multgenes,] >> df.sample.gene >> >> I added a second line with multiple genes to make sure that it would >> work with more than one line. >> >> Jim >> >> >> On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tan...@gmail.com> wrote: >>> >>> I would appreciate please a suggestion on how to do the following : >>> >>> i'm working with a dataframe in R that contains in a specific column >>> multiple gene names, eg : >>> >>>> df.sample.gene[15:20,2:8] >>> >>> Chr Start End Ref Alt Func.refGene >>> Gene.refGene284 chr2 16080996 16080996 C T ncRNA_exonic >>> GACAT3448 chr2 113979920 113979920 C T ncRNA_exonic >>> LINC01191,LOC100499194465 chr2 131279347 131279347 C G >>> ncRNA_exonic LOC440910525 chr2 223777758 223777758 T >>> A exonic AP1S3626 chr3 99794575 99794575 G >>> A exonic COL8A1643 chr3 132601066 132601066 A >>> G exonic ACKR4 >>> >>> How could I obtain a dataframe where each line that has multiple gene >>> names >>> (in the field Gene.refGene) is replicated with only one gene name ? i.e. >>> >>> for the second row : >>> >>> 448 chr2 113979920 113979920 C T ncRNA_exonic >>> LINC01191,LOC100499194 >>> >>> we shall get in the final output (that contains all the rows) : >>> >>> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191 >>> 448 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194 >>> >>> thanks a lot ! >>> >>> -- bogdan >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.