On Fri, Oct 30, 2009 at 12:52 AM, zhuocheng Hou <[email protected]> wrote:

> Hi Everyone,
>
> I used the awk script which provided by Brian(as follows) to concatenate
> all the exon alignments into one file. I am not familar with awk, so I only
> copy scripts to run on the sequence file directly as suggested. I found some
> stranges for the results.
>
> (1) I found lots of stop codons for the CDS sequences, i.e., NM_002099,
> NM_2193, this is the widely existed phenomenon for the exon alignment file.
> I used the refGene.exonnuc.fa file.
> (2) I don't know how genome browser group generate the 44way refseq exon
> alignment file. I found some duplicates in the sequence file, i.e.,
> NM_001320
>
> Can anyone explain a little about these two questions?
>
> Thanks,
> Zhuocheng
>
>
>
>
> On Thu, Oct 29, 2009 at 5:34 PM, Brian Raney <[email protected]> wrote:
>
>> Hey Zhoucheng,
>>
>> There are a couple of ways you can get the full CDS for refSeq genes for
>> all the species with aligning sequence in the 44way.
>>
>> If you have a small set of genes you're interested in, the easiest way
>> would be to use the table browser.  If you want the full set of genes
>> represented in the refSeq set, then you can parse the download file by
>> concatenating the exons.  I'll describe both these methods below.
>>
>> First, the format of the entries in the CDS FASTA data set, and how to get
>> them out of the table browser, is described here:
>> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA
>>
>> If you're not familiar with using the Table Browser, you can read the
>> tutorial here:
>> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
>>
>> Secondly, if you want the whole CDS from the exon only downloads you can
>> just concatenate all the exons for a particular gene together.  I include an
>> awk script below which does this (WARNING: awk script not validated by our
>> QA dept. Use at your own risk).
>>
>> If this doesn't answer your question, feel free to write back to this
>> list.
>>
>> Brian Raney
>>
>> ---
>>
>> To run script:
>>
>> $ zcat refGene.exonAA.fa.gz | awk -f awk.script
>>
>> where awk.script is a file with the following in it:
>>
>> />/ {
>> geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies);
>> species=geneSpecies; gsub(".+_","", species);
>> speciesList[species]=1;
>> gene=geneSpecies;gsub("_" species,"",gene);
>> if (geneBuf[species] != gene)
>>    {
>>    if (geneBuf[species] != "")
>>        print geneBuf[species] "_" species, size[species] "\n"
>> sequence[species];
>>    geneBuf[species]=gene; sequence[species]=""; size[species]=$2
>>    }
>> else
>>    {size[species] += $2}
>> }
>>
>> /^[A-Z-]/ {sequence[species] = sequence[species] $1}
>>
>> END {for(ii in speciesList)
>>        print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii];
>>    }
>>
>>
>>
>>
>> On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]> wrote:
>> >
>> > Hi Everyone,
>> >
>> > I want to exact CDS region from the 44way_refseq alignment file.
>> However,
>> > this alignment was based on the exon. Do anyone can give some
>> information
>> > for this file about how to link these exons into full CDS?
>> >
>> > The sequence file like this: NM_001077470_hg18_1_7, what's the meaning
>> of
>> > the _1_7?
>> >
>> > Thanks
>> > Zhuocheng
>> > _______________________________________________
>> > Genome maillist  -  [email protected]
>> > https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>
>>
>>
>
>
> --
> Zhuocheng Hou, Ph.D.
> PRB/NICHD/NIH
> Wayne State University School of Medicine
> 540 E. Canfield Avenue
> Detroit, MI 48201
>



-- 
Zhuocheng Hou, Ph.D.
PRB/NICHD/NIH
Wayne State University School of Medicine
540 E. Canfield Avenue
Detroit, MI 48201
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to