I'm not seeing stop codons in NM_002099. Did you remember to reverse complement since it's on the negative strand? There are some cases (42) where there are stop codons because of selanocysteine, but it's rare, and NM_002099 is not one of them.
On Oct 30, 2009, at 9:53 AM, zhuocheng Hou wrote: > On Fri, Oct 30, 2009 at 12:52 AM, zhuocheng Hou <[email protected]> > wrote: > >> Hi Everyone, >> >> I used the awk script which provided by Brian(as follows) to >> concatenate >> all the exon alignments into one file. I am not familar with awk, >> so I only >> copy scripts to run on the sequence file directly as suggested. I >> found some >> stranges for the results. >> >> (1) I found lots of stop codons for the CDS sequences, i.e., >> NM_002099, >> NM_2193, this is the widely existed phenomenon for the exon >> alignment file. >> I used the refGene.exonnuc.fa file. >> (2) I don't know how genome browser group generate the 44way refseq >> exon >> alignment file. I found some duplicates in the sequence file, i.e., >> NM_001320 >> >> Can anyone explain a little about these two questions? >> >> Thanks, >> Zhuocheng >> >> >> >> >> On Thu, Oct 29, 2009 at 5:34 PM, Brian Raney <[email protected]> >> wrote: >> >>> Hey Zhoucheng, >>> >>> There are a couple of ways you can get the full CDS for refSeq >>> genes for >>> all the species with aligning sequence in the 44way. >>> >>> If you have a small set of genes you're interested in, the easiest >>> way >>> would be to use the table browser. If you want the full set of >>> genes >>> represented in the refSeq set, then you can parse the download >>> file by >>> concatenating the exons. I'll describe both these methods below. >>> >>> First, the format of the entries in the CDS FASTA data set, and >>> how to get >>> them out of the table browser, is described here: >>> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA >>> >>> If you're not familiar with using the Table Browser, you can read >>> the >>> tutorial here: >>> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html >>> >>> Secondly, if you want the whole CDS from the exon only downloads >>> you can >>> just concatenate all the exons for a particular gene together. I >>> include an >>> awk script below which does this (WARNING: awk script not >>> validated by our >>> QA dept. Use at your own risk). >>> >>> If this doesn't answer your question, feel free to write back to >>> this >>> list. >>> >>> Brian Raney >>> >>> --- >>> >>> To run script: >>> >>> $ zcat refGene.exonAA.fa.gz | awk -f awk.script >>> >>> where awk.script is a file with the following in it: >>> >>> />/ { >>> geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies); >>> species=geneSpecies; gsub(".+_","", species); >>> speciesList[species]=1; >>> gene=geneSpecies;gsub("_" species,"",gene); >>> if (geneBuf[species] != gene) >>> { >>> if (geneBuf[species] != "") >>> print geneBuf[species] "_" species, size[species] "\n" >>> sequence[species]; >>> geneBuf[species]=gene; sequence[species]=""; size[species]=$2 >>> } >>> else >>> {size[species] += $2} >>> } >>> >>> /^[A-Z-]/ {sequence[species] = sequence[species] $1} >>> >>> END {for(ii in speciesList) >>> print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii]; >>> } >>> >>> >>> >>> >>> On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]> >>> wrote: >>>> >>>> Hi Everyone, >>>> >>>> I want to exact CDS region from the 44way_refseq alignment file. >>> However, >>>> this alignment was based on the exon. Do anyone can give some >>> information >>>> for this file about how to link these exons into full CDS? >>>> >>>> The sequence file like this: NM_001077470_hg18_1_7, what's the >>>> meaning >>> of >>>> the _1_7? >>>> >>>> Thanks >>>> Zhuocheng >>>> _______________________________________________ >>>> Genome maillist - [email protected] >>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>> >>> >>> >> >> >> -- >> Zhuocheng Hou, Ph.D. >> PRB/NICHD/NIH >> Wayne State University School of Medicine >> 540 E. Canfield Avenue >> Detroit, MI 48201 >> > > > > -- > Zhuocheng Hou, Ph.D. > PRB/NICHD/NIH > Wayne State University School of Medicine > 540 E. Canfield Avenue > Detroit, MI 48201 > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
