Hey Zhuocheng, I think Jim did a good job answering your first question (thanks Jim!). The answer to your second question is that the refSeq gene you mention is on chr6 for which we support a couple of haplotype chromosomes, so we annotate that gene as being on chr6, as well as chr6_cox_hap1and chr6_qbl_hap2. My awk script doesn't carry along the chrom addresses so this data doesn't show up in it's output, but it is in the original files.
Brian Raney On Fri, Oct 30, 2009 at 6:53 AM, zhuocheng Hou <[email protected]> wrote: > On Fri, Oct 30, 2009 at 12:52 AM, zhuocheng Hou <[email protected]> wrote: > > > Hi Everyone, > > > > I used the awk script which provided by Brian(as follows) to concatenate > > all the exon alignments into one file. I am not familar with awk, so I > only > > copy scripts to run on the sequence file directly as suggested. I found > some > > stranges for the results. > > > > (1) I found lots of stop codons for the CDS sequences, i.e., NM_002099, > > NM_2193, this is the widely existed phenomenon for the exon alignment > file. > > I used the refGene.exonnuc.fa file. > > (2) I don't know how genome browser group generate the 44way refseq exon > > alignment file. I found some duplicates in the sequence file, i.e., > > NM_001320 > > > > Can anyone explain a little about these two questions? > > > > Thanks, > > Zhuocheng > > > > > > > > > > On Thu, Oct 29, 2009 at 5:34 PM, Brian Raney <[email protected]> > wrote: > > > >> Hey Zhoucheng, > >> > >> There are a couple of ways you can get the full CDS for refSeq genes for > >> all the species with aligning sequence in the 44way. > >> > >> If you have a small set of genes you're interested in, the easiest way > >> would be to use the table browser. If you want the full set of genes > >> represented in the refSeq set, then you can parse the download file by > >> concatenating the exons. I'll describe both these methods below. > >> > >> First, the format of the entries in the CDS FASTA data set, and how to > get > >> them out of the table browser, is described here: > >> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA > >> > >> If you're not familiar with using the Table Browser, you can read the > >> tutorial here: > >> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html > >> > >> Secondly, if you want the whole CDS from the exon only downloads you can > >> just concatenate all the exons for a particular gene together. I > include an > >> awk script below which does this (WARNING: awk script not validated by > our > >> QA dept. Use at your own risk). > >> > >> If this doesn't answer your question, feel free to write back to this > >> list. > >> > >> Brian Raney > >> > >> --- > >> > >> To run script: > >> > >> $ zcat refGene.exonAA.fa.gz | awk -f awk.script > >> > >> where awk.script is a file with the following in it: > >> > >> />/ { > >> geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies); > >> species=geneSpecies; gsub(".+_","", species); > >> speciesList[species]=1; > >> gene=geneSpecies;gsub("_" species,"",gene); > >> if (geneBuf[species] != gene) > >> { > >> if (geneBuf[species] != "") > >> print geneBuf[species] "_" species, size[species] "\n" > >> sequence[species]; > >> geneBuf[species]=gene; sequence[species]=""; size[species]=$2 > >> } > >> else > >> {size[species] += $2} > >> } > >> > >> /^[A-Z-]/ {sequence[species] = sequence[species] $1} > >> > >> END {for(ii in speciesList) > >> print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii]; > >> } > >> > >> > >> > >> > >> On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]> > wrote: > >> > > >> > Hi Everyone, > >> > > >> > I want to exact CDS region from the 44way_refseq alignment file. > >> However, > >> > this alignment was based on the exon. Do anyone can give some > >> information > >> > for this file about how to link these exons into full CDS? > >> > > >> > The sequence file like this: NM_001077470_hg18_1_7, what's the meaning > >> of > >> > the _1_7? > >> > > >> > Thanks > >> > Zhuocheng > >> > _______________________________________________ > >> > Genome maillist - [email protected] > >> > https://lists.soe.ucsc.edu/mailman/listinfo/genome > >> > >> > >> > > > > > > -- > > Zhuocheng Hou, Ph.D. > > PRB/NICHD/NIH > > Wayne State University School of Medicine > > 540 E. Canfield Avenue > > Detroit, MI 48201 > > > > > > -- > Zhuocheng Hou, Ph.D. > PRB/NICHD/NIH > Wayne State University School of Medicine > 540 E. Canfield Avenue > Detroit, MI 48201 > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
