Hey Zhoucheng,

There are a couple of ways you can get the full CDS for refSeq genes for all
the species with aligning sequence in the 44way.

If you have a small set of genes you're interested in, the easiest way would
be to use the table browser.  If you want the full set of genes represented
in the refSeq set, then you can parse the download file by concatenating the
exons.  I'll describe both these methods below.

First, the format of the entries in the CDS FASTA data set, and how to get
them out of the table browser, is described here:
http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA

If you're not familiar with using the Table Browser, you can read the
tutorial here:
http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html

Secondly, if you want the whole CDS from the exon only downloads you can
just concatenate all the exons for a particular gene together.  I include an
awk script below which does this (WARNING: awk script not validated by our
QA dept. Use at your own risk).

If this doesn't answer your question, feel free to write back to this list.

Brian Raney

---

To run script:

$ zcat refGene.exonAA.fa.gz | awk -f awk.script

where awk.script is a file with the following in it:

/>/ {
geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies);
species=geneSpecies; gsub(".+_","", species);
speciesList[species]=1;
gene=geneSpecies;gsub("_" species,"",gene);
if (geneBuf[species] != gene)
   {
   if (geneBuf[species] != "")
       print geneBuf[species] "_" species, size[species] "\n"
sequence[species];
   geneBuf[species]=gene; sequence[species]=""; size[species]=$2
   }
else
   {size[species] += $2}
}

/^[A-Z-]/ {sequence[species] = sequence[species] $1}

END {for(ii in speciesList)
       print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii];
   }




On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]> wrote:
>
> Hi Everyone,
>
> I want to exact CDS region from the 44way_refseq alignment file. However,
> this alignment was based on the exon. Do anyone can give some information
> for this file about how to link these exons into full CDS?
>
> The sequence file like this: NM_001077470_hg18_1_7, what's the meaning of
> the _1_7?
>
> Thanks
> Zhuocheng
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to