Re: [Genome] Wrong exon alignment or wrong scripts?

Brian Raney Fri, 30 Oct 2009 11:43:00 -0700

Hey Zhuocheng,

No gene sets other than the human are used to generate these alignments.
 These are entirely nucleotide alignments with no gene models used in the
alignment process.  You can look here for how the alignments are generated:
http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto


<http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto>As you
have noticed, these alignments are not always trustworthy and should be
carefully examined before trusting that the correct sequence is being
aligned since it's very difficult for a gene agnostic aligner to know if the
aligning sequence is correct, or merely the most similar available in the
aligning genome.  There have been some attempts to judge the likelihood of a
particular alignment being correct given the tree topology, but the current
state-of-the-art still requires that the alignment user spend the time to
validate an alignment given what they know about what seems likely.

Brian

On Fri, Oct 30, 2009 at 11:32 AM, zhuocheng Hou <[email protected]> wrote:

> Thanks Jim and Brian. We can find numerous reasons to explain the stop
> codon on exon. The question is that pipeline/algorithm for predicting those
> exon is not suitable for low coverage genome at many situations. The
> analysis based on these alignment is questionable. What genesets used for
> genome browser for other genomes, i.e., chimp, cow? Can we find some
> solutions to get  better alignment without/very small portions stop codon on
> CDS?
>
>
>
>
> On Fri, Oct 30, 2009 at 1:33 PM, Brian Raney <[email protected]> wrote:
>
>> Hey Zhuocheng,
>>
>> I think Jim did a good job answering your first question (thanks Jim!).
>>  The answer to your second question is that the refSeq gene you mention is
>> on chr6 for which we support a couple of haplotype chromosomes, so we
>> annotate that gene as being on chr6, as well as
>> chr6_cox_hap1and chr6_qbl_hap2.   My awk script doesn't carry along the
>> chrom addresses so this data doesn't show up in it's output, but it is in
>> the original files.
>>
>> Brian Raney
>>
>>
>> On Fri, Oct 30, 2009 at 6:53 AM, zhuocheng Hou <[email protected]> wrote:
>>
>>> On Fri, Oct 30, 2009 at 12:52 AM, zhuocheng Hou <[email protected]> wrote:
>>>
>>> > Hi Everyone,
>>> >
>>> > I used the awk script which provided by Brian(as follows) to
>>> concatenate
>>> > all the exon alignments into one file. I am not familar with awk, so I
>>> only
>>> > copy scripts to run on the sequence file directly as suggested. I found
>>> some
>>> > stranges for the results.
>>> >
>>> > (1) I found lots of stop codons for the CDS sequences, i.e., NM_002099,
>>> > NM_2193, this is the widely existed phenomenon for the exon alignment
>>> file.
>>> > I used the refGene.exonnuc.fa file.
>>> > (2) I don't know how genome browser group generate the 44way refseq
>>> exon
>>> > alignment file. I found some duplicates in the sequence file, i.e.,
>>> > NM_001320
>>> >
>>> > Can anyone explain a little about these two questions?
>>> >
>>> > Thanks,
>>> > Zhuocheng
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Oct 29, 2009 at 5:34 PM, Brian Raney <[email protected]>
>>> wrote:
>>> >
>>> >> Hey Zhoucheng,
>>> >>
>>> >> There are a couple of ways you can get the full CDS for refSeq genes
>>> for
>>> >> all the species with aligning sequence in the 44way.
>>> >>
>>> >> If you have a small set of genes you're interested in, the easiest way
>>> >> would be to use the table browser.  If you want the full set of genes
>>> >> represented in the refSeq set, then you can parse the download file by
>>> >> concatenating the exons.  I'll describe both these methods below.
>>> >>
>>> >> First, the format of the entries in the CDS FASTA data set, and how to
>>> get
>>> >> them out of the table browser, is described here:
>>> >> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA
>>> >>
>>> >> If you're not familiar with using the Table Browser, you can read the
>>> >> tutorial here:
>>> >> http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
>>> >>
>>> >> Secondly, if you want the whole CDS from the exon only downloads you
>>> can
>>> >> just concatenate all the exons for a particular gene together.  I
>>> include an
>>> >> awk script below which does this (WARNING: awk script not validated by
>>> our
>>> >> QA dept. Use at your own risk).
>>> >>
>>> >> If this doesn't answer your question, feel free to write back to this
>>> >> list.
>>> >>
>>> >> Brian Raney
>>> >>
>>> >> ---
>>> >>
>>> >> To run script:
>>> >>
>>> >> $ zcat refGene.exonAA.fa.gz | awk -f awk.script
>>> >>
>>> >> where awk.script is a file with the following in it:
>>> >>
>>> >> />/ {
>>> >> geneSpecies=$1;gsub("_[0-9]+_[0-9]+","",geneSpecies);
>>> >> species=geneSpecies; gsub(".+_","", species);
>>> >> speciesList[species]=1;
>>> >> gene=geneSpecies;gsub("_" species,"",gene);
>>> >> if (geneBuf[species] != gene)
>>> >>    {
>>> >>    if (geneBuf[species] != "")
>>> >>        print geneBuf[species] "_" species, size[species] "\n"
>>> >> sequence[species];
>>> >>    geneBuf[species]=gene; sequence[species]=""; size[species]=$2
>>> >>    }
>>> >> else
>>> >>    {size[species] += $2}
>>> >> }
>>> >>
>>> >> /^[A-Z-]/ {sequence[species] = sequence[species] $1}
>>> >>
>>> >> END {for(ii in speciesList)
>>> >>        print geneBuf[ii] "_" ii, size[ii] "\n" sequence[ii];
>>> >>    }
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Oct 29, 2009 at 11:27 AM, zhuocheng Hou <[email protected]>
>>> wrote:
>>> >> >
>>> >> > Hi Everyone,
>>> >> >
>>> >> > I want to exact CDS region from the 44way_refseq alignment file.
>>> >> However,
>>> >> > this alignment was based on the exon. Do anyone can give some
>>> >> information
>>> >> > for this file about how to link these exons into full CDS?
>>> >> >
>>> >> > The sequence file like this: NM_001077470_hg18_1_7, what's the
>>> meaning
>>> >> of
>>> >> > the _1_7?
>>> >> >
>>> >> > Thanks
>>> >> > Zhuocheng
>>> >> > _______________________________________________
>>> >> > Genome maillist  -  [email protected]
>>> >> > https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Zhuocheng Hou, Ph.D.
>>> > PRB/NICHD/NIH
>>> > Wayne State University School of Medicine
>>> > 540 E. Canfield Avenue
>>> > Detroit, MI 48201
>>> >
>>>
>>>
>>>
>>> --
>>> Zhuocheng Hou, Ph.D.
>>> PRB/NICHD/NIH
>>> Wayne State University School of Medicine
>>> 540 E. Canfield Avenue
>>> Detroit, MI 48201
>>> _______________________________________________
>>> Genome maillist  -  [email protected]
>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>
>>
>>
>
>
> --
> Zhuocheng Hou, Ph.D.
> PRB/NICHD/NIH
> Wayne State University School of Medicine
> 540 E. Canfield Avenue
> Detroit, MI 48201
>
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Wrong exon alignment or wrong scripts?

Reply via email to