On Fri, Oct 30, 2009 at 12:52 AM, zhuocheng Hou <[email protected]> wrote:
> Hi Everyone, > > I used the awk script which provided by Brian(as follows) to concatenate > all the exon alignments into one file. I am not familar with awk, so I only > copy scripts to run on the sequence file directly as suggested. I found some > stranges for the results. > > (1) I found lots of stop codons for the CDS sequences, i.e., NM_002099, > NM_2193, this is the widely existed phenomenon for the exon alignment file. > I used the refGene.exonnuc.fa file. > (2) I don't know how genome browser group generate the 44way refseq exon > alignment file. I found some duplicates in the sequence file, i.e., > NM_001320 > > Can anyone explain a little about these two questions? > > Thanks, > Zhuocheng > > > > > On Thu, Oct 29, 2009 at 5:34 PM, Brian Raney <[email protected]> wrote: > >> Hey Zhoucheng, >> >> There are a couple of ways you can get the full CDS for refSeq genes for >> all the species with aligning sequence in the 44way. >> >> If you have a small set of genes you're interested in, the easiest way >> would be to use the table browser. If you want the full set of genes >> represented in the refSeq set, then you can parse the download file by >> concatenating the exons. I'll describe both these methods below. >> >> First, the format of the entries in the CDS FASTA data set, and how to get >> them out of the table browser, is described here: >> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA >> >> If you're not familiar with using the Table Browser, you can read the >> tutorial here: >> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html >> >> Secondly, if you want the whole CDS from the exon only downloads you can >> just concatenate all the exons for a particular gene together. I include an >> awk script below which does this (WARNING: awk script not validated by our >> QA dept. Use at your own risk). >> >> If this doesn't answer your question, feel free to write back to this >> list. >> >> Brian Raney >> >> --- >> >> To run script: >> >> $ zcat refGene.exonAA.fa.gz | awk -f awk.script >> >> where awk.script is a file with the following in it: >> >> />/ { >> geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies); >> species=geneSpecies; gsub(".+_","", species); >> speciesList[species]=1; >> gene=geneSpecies;gsub("_" species,"",gene); >> if (geneBuf[species] != gene) >> { >> if (geneBuf[species] != "") >> print geneBuf[species] "_" species, size[species] "\n" >> sequence[species]; >> geneBuf[species]=gene; sequence[species]=""; size[species]=$2 >> } >> else >> {size[species] += $2} >> } >> >> /^[A-Z-]/ {sequence[species] = sequence[species] $1} >> >> END {for(ii in speciesList) >> print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii]; >> } >> >> >> >> >> On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]> wrote: >> > >> > Hi Everyone, >> > >> > I want to exact CDS region from the 44way_refseq alignment file. >> However, >> > this alignment was based on the exon. Do anyone can give some >> information >> > for this file about how to link these exons into full CDS? >> > >> > The sequence file like this: NM_001077470_hg18_1_7, what's the meaning >> of >> > the _1_7? >> > >> > Thanks >> > Zhuocheng >> > _______________________________________________ >> > Genome maillist - [email protected] >> > https://lists.soe.ucsc.edu/mailman/listinfo/genome >> >> >> > > > -- > Zhuocheng Hou, Ph.D. > PRB/NICHD/NIH > Wayne State University School of Medicine > 540 E. Canfield Avenue > Detroit, MI 48201 > -- Zhuocheng Hou, Ph.D. PRB/NICHD/NIH Wayne State University School of Medicine 540 E. Canfield Avenue Detroit, MI 48201 _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
