Re: [Genome] Wrong exon alignment or wrong scripts?

Jim Kent Fri, 30 Oct 2009 14:56:13 -0700

In the UCSC genes I mostly use rhesus, mouse, and dog to add some  
weight to the CDS scores.
In general it sounds like for your purposes you should throw out the  
2x assemblies at the
least.


On Oct 30, 2009, at 2:42 PM, Brian Raney wrote:

> Hey Zhuocheng,
>
> No gene sets other than the human are used to generate these  
> alignments.
> These are entirely nucleotide alignments with no gene models used in  
> the
> alignment process.  You can look here for how the alignments are  
> generated:
> http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto
>
> <http://genomewiki.ucsc.edu/index.php/ 
> Whole_genome_alignment_howto>As you
> have noticed, these alignments are not always trustworthy and should  
> be
> carefully examined before trusting that the correct sequence is being
> aligned since it's very difficult for a gene agnostic aligner to  
> know if the
> aligning sequence is correct, or merely the most similar available  
> in the
> aligning genome.  There have been some attempts to judge the  
> likelihood of a
> particular alignment being correct given the tree topology, but the  
> current
> state-of-the-art still requires that the alignment user spend the  
> time to
> validate an alignment given what they know about what seems likely.
>
> Brian
>
> On Fri, Oct 30, 2009 at 11:32 AM, zhuocheng Hou <[email protected]>  
> wrote:
>
>> Thanks Jim and Brian. We can find numerous reasons to explain the  
>> stop
>> codon on exon. The question is that pipeline/algorithm for  
>> predicting those
>> exon is not suitable for low coverage genome at many situations. The
>> analysis based on these alignment is questionable. What genesets  
>> used for
>> genome browser for other genomes, i.e., chimp, cow? Can we find some
>> solutions to get  better alignment without/very small portions stop  
>> codon on
>> CDS?
>>
>>
>>
>>
>> On Fri, Oct 30, 2009 at 1:33 PM, Brian Raney <[email protected]>  
>> wrote:
>>
>>> Hey Zhuocheng,
>>>
>>> I think Jim did a good job answering your first question (thanks  
>>> Jim!).
>>> The answer to your second question is that the refSeq gene you  
>>> mention is
>>> on chr6 for which we support a couple of haplotype chromosomes, so  
>>> we
>>> annotate that gene as being on chr6, as well as
>>> chr6_cox_hap1and chr6_qbl_hap2.   My awk script doesn't carry  
>>> along the
>>> chrom addresses so this data doesn't show up in it's output, but  
>>> it is in
>>> the original files.
>>>
>>> Brian Raney
>>>
>>>
>>> On Fri, Oct 30, 2009 at 6:53 AM, zhuocheng Hou <[email protected]>  
>>> wrote:
>>>
>>>> On Fri, Oct 30, 2009 at 12:52 AM, zhuocheng Hou <[email protected]>  
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I used the awk script which provided by Brian(as follows) to
>>>> concatenate
>>>>> all the exon alignments into one file. I am not familar with  
>>>>> awk, so I
>>>> only
>>>>> copy scripts to run on the sequence file directly as suggested.  
>>>>> I found
>>>> some
>>>>> stranges for the results.
>>>>>
>>>>> (1) I found lots of stop codons for the CDS sequences, i.e.,  
>>>>> NM_002099,
>>>>> NM_2193, this is the widely existed phenomenon for the exon  
>>>>> alignment
>>>> file.
>>>>> I used the refGene.exonnuc.fa file.
>>>>> (2) I don't know how genome browser group generate the 44way  
>>>>> refseq
>>>> exon
>>>>> alignment file. I found some duplicates in the sequence file,  
>>>>> i.e.,
>>>>> NM_001320
>>>>>
>>>>> Can anyone explain a little about these two questions?
>>>>>
>>>>> Thanks,
>>>>> Zhuocheng
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 29, 2009 at 5:34 PM, Brian Raney <[email protected]>
>>>> wrote:
>>>>>
>>>>>> Hey Zhoucheng,
>>>>>>
>>>>>> There are a couple of ways you can get the full CDS for refSeq  
>>>>>> genes
>>>> for
>>>>>> all the species with aligning sequence in the 44way.
>>>>>>
>>>>>> If you have a small set of genes you're interested in, the  
>>>>>> easiest way
>>>>>> would be to use the table browser.  If you want the full set of  
>>>>>> genes
>>>>>> represented in the refSeq set, then you can parse the download  
>>>>>> file by
>>>>>> concatenating the exons.  I'll describe both these methods below.
>>>>>>
>>>>>> First, the format of the entries in the CDS FASTA data set, and  
>>>>>> how to
>>>> get
>>>>>> them out of the table browser, is described here:
>>>>>> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA
>>>>>>
>>>>>> If you're not familiar with using the Table Browser, you can  
>>>>>> read the
>>>>>> tutorial here:
>>>>>> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
>>>>>>
>>>>>> Secondly, if you want the whole CDS from the exon only  
>>>>>> downloads you
>>>> can
>>>>>> just concatenate all the exons for a particular gene together.  I
>>>> include an
>>>>>> awk script below which does this (WARNING: awk script not  
>>>>>> validated by
>>>> our
>>>>>> QA dept. Use at your own risk).
>>>>>>
>>>>>> If this doesn't answer your question, feel free to write back  
>>>>>> to this
>>>>>> list.
>>>>>>
>>>>>> Brian Raney
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> To run script:
>>>>>>
>>>>>> $ zcat refGene.exonAA.fa.gz | awk -f awk.script
>>>>>>
>>>>>> where awk.script is a file with the following in it:
>>>>>>
>>>>>> />/ {
>>>>>> geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies);
>>>>>> species=geneSpecies; gsub(".+_","", species);
>>>>>> speciesList[species]=1;
>>>>>> gene=geneSpecies;gsub("_" species,"",gene);
>>>>>> if (geneBuf[species] != gene)
>>>>>>   {
>>>>>>   if (geneBuf[species] != "")
>>>>>>       print geneBuf[species] "_" species, size[species] "\n"
>>>>>> sequence[species];
>>>>>>   geneBuf[species]=gene; sequence[species]=""; size[species]=$2
>>>>>>   }
>>>>>> else
>>>>>>   {size[species] += $2}
>>>>>> }
>>>>>>
>>>>>> /^[A-Z-]/ {sequence[species] = sequence[species] $1}
>>>>>>
>>>>>> END {for(ii in speciesList)
>>>>>>       print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii];
>>>>>>   }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]>
>>>> wrote:
>>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> I want to exact CDS region from the 44way_refseq alignment file.
>>>>>> However,
>>>>>>> this alignment was based on the exon. Do anyone can give some
>>>>>> information
>>>>>>> for this file about how to link these exons into full CDS?
>>>>>>>
>>>>>>> The sequence file like this: NM_001077470_hg18_1_7, what's the
>>>> meaning
>>>>>> of
>>>>>>> the _1_7?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Zhuocheng
>>>>>>> _______________________________________________
>>>>>>> Genome maillist  -  [email protected]
>>>>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Zhuocheng Hou, Ph.D.
>>>>> PRB/NICHD/NIH
>>>>> Wayne State University School of Medicine
>>>>> 540 E. Canfield Avenue
>>>>> Detroit, MI 48201
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Zhuocheng Hou, Ph.D.
>>>> PRB/NICHD/NIH
>>>> Wayne State University School of Medicine
>>>> 540 E. Canfield Avenue
>>>> Detroit, MI 48201
>>>> _______________________________________________
>>>> Genome maillist  -  [email protected]
>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>>
>>>
>>>
>>
>>
>> --
>> Zhuocheng Hou, Ph.D.
>> PRB/NICHD/NIH
>> Wayne State University School of Medicine
>> 540 E. Canfield Avenue
>> Detroit, MI 48201
>>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Wrong exon alignment or wrong scripts?

Reply via email to