Hey Zhoucheng, There are a couple of ways you can get the full CDS for refSeq genes for all the species with aligning sequence in the 44way.
If you have a small set of genes you're interested in, the easiest way would be to use the table browser. If you want the full set of genes represented in the refSeq set, then you can parse the download file by concatenating the exons. I'll describe both these methods below. First, the format of the entries in the CDS FASTA data set, and how to get them out of the table browser, is described here: http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA If you're not familiar with using the Table Browser, you can read the tutorial here: http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html Secondly, if you want the whole CDS from the exon only downloads you can just concatenate all the exons for a particular gene together. I include an awk script below which does this (WARNING: awk script not validated by our QA dept. Use at your own risk). If this doesn't answer your question, feel free to write back to this list. Brian Raney --- To run script: $ zcat refGene.exonAA.fa.gz | awk -f awk.script where awk.script is a file with the following in it: />/ { geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies); species=geneSpecies; gsub(".+_","", species); speciesList[species]=1; gene=geneSpecies;gsub("_" species,"",gene); if (geneBuf[species] != gene) { if (geneBuf[species] != "") print geneBuf[species] "_" species, size[species] "\n" sequence[species]; geneBuf[species]=gene; sequence[species]=""; size[species]=$2 } else {size[species] += $2} } /^[A-Z-]/ {sequence[species] = sequence[species] $1} END {for(ii in speciesList) print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii]; } On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]> wrote: > > Hi Everyone, > > I want to exact CDS region from the 44way_refseq alignment file. However, > this alignment was based on the exon. Do anyone can give some information > for this file about how to link these exons into full CDS? > > The sequence file like this: NM_001077470_hg18_1_7, what's the meaning of > the _1_7? > > Thanks > Zhuocheng > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
