I see. Thank you. Sean
-----Original Message----- From: Vanessa Kirkup Swing [mailto:[email protected]] Sent: Thursday, July 07, 2011 5:06 PM To: Xiang Li Cc: [email protected] Subject: Re: [Genome] [help] Lots of stop codons in multiz46way protein alignment file Hi Sean, Here is what one of our engineer's had to say: "The 46-way multiple alignment is built without regard to the gene models. These CDS FASTA alignments are created with the assumption that the exon is the same size in all the other species as it is in human. This is not true for all exons, so if the tenrec does have an earlier stop codon, the CDS FASYA alignments will still show the exon as the same size as the human one. Also, the Tenrec assembly is a 2X assembly, so there are likely to be many places where the assembly is not an accurate representation of the tenrec genome. It's up to the researcher to filter out regions that don't conform to their own expectations." Hope that this helps to clarify things for you. Vanessa Kirkup Swing UCSC Genome Bioinformatics Group On Thu, Jul 7, 2011 at 2:20 PM, Xiang Li <[email protected]> wrote: > Hey, Mary, > > > > Thank you tremendously! > > > > Regarding the 2nd email, actually I was asking why there are letters > after a "Z". There are 217273 such kind of cases, such as > > > > $ grep -B1 Z[A-Z] refGene.exonAA.fa > >>NM_000152_echTel1_14_19 50 0 2 scaffold_298195:1056-1143+ > > PQEPYRFGEQAQSAMRKAL-LRYALLPZL--------------------- > > > >>NM_001080397_dipOrd1_2_8 31 1 1 scaffold_7684:5-74+ > > --------GAZSDRC-SRFG--RPFI-VLAI > > > > I would think no amino acids after "Z". Right? > > > > Best > > > > Sean > > > > > > From: Mary Goldman [mailto:[email protected]] > Sent: Thursday, July 07, 2011 11:50 AM > To: Xiang Li > Cc: [email protected] > Subject: Re: [Genome] [help] Lots of stop codons in multiz46way protein > alignment file > > > > Hi Sean, > > Yes, you can get the entire genome alignment here: > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/. Beware, > the compressed data size of these files is 31 Gb and uncompressed is > more than 250 Gb. For a description of multiple alignment format (MAF), > see http://genome.ucsc.edu/goldenPath/help/maf.html. > > Also, in response to your other email, stop codons are represented with > a Z. Information about this file format, including non-protein > characters can be found here: > http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA. > > I hope this information is helpful. Please contact us again at > [email protected] if you have any further questions. > > Best, > Mary > ------------------ > Mary Goldman > UCSC Bioinformatics Group > > On 7/6/11 4:38 PM, Xiang Li wrote: > > Hi, Mary, > > > > I got it. Thanks a lot! > > > > BTW, from the webpage you pointed to me, it seems there are multiple > alignments at the DNA level between entire genomes, i.e., not only just > for CDS regions, but also for entire exon and intronic regions. > > > > Is my understanding correct? If so, could you please instruct me how to > get that MAFs? > > > > Thanks! > > > > > > Sean > > > > From: Mary Goldman [mailto:[email protected]] > Sent: Wednesday, July 06, 2011 4:30 PM > To: Xiang Li > Cc: [email protected] > Subject: Re: [Genome] [help] Lots of stop codons in multiz46way protein > alignment file > > > > Hi Sean, > > You can also view the Gorilla browser at our preview site here: > http://genome-preview.cse.ucsc.edu/cgi-bin/hgTracks?db=gorGor1. It tends > to be more reliably available than our test site. Our preview site > carries the same warning that tracks and data on the test server have > not undergone formal quality assurance. > > Best, > Mary > --------------------- > Mary Goldman > UCSC Bioinformatics Group > > On 7/6/11 4:16 PM, Mary Goldman wrote: > > Hi Sean, > > Codons with an N in any position are represented with an X (stop codons > are represented with a Z). Assemblies that are not well sequenced, such > as the Gorilla (gorGor1) will have quite a few Ns (which are bases with > low quality scores) and, thus, quite a few Xs in the protein alignment > file. You can confirm this by viewing the gorGor1 assembly on our test > browser here: > http://genome-test.cse.ucsc.edu/cgi-bin/hgTracks?db=gorGor1. Please note > that tracks and data on the test server have not undergone formal > quality assurance. > > More information about this file format can be found here: > http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA. > > I hope this information is helpful. Please contact us again at > [email protected] if you have any further questions. > > Best, > Mary > ------------------ > Mary Goldman > UCSC Bioinformatics Group > > > > On 7/6/11 3:21 PM, Xiang Li wrote: > > Hi, Dear Support, > > > > It would be easy to understand if they are at the end of a protein > sequence. However, could you please help me understand why there are so > many "X"es inside some sequences? > > > > http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/alignments/re > fGene.exonAA.fa.gz > > > > > NM_000152_gorGor1_18_19 51 0 0 Supercontig_0039638:17387-17539+ > > NXIXNELVXVTSEGAGLQLQKVTVLGVATAPQQVXSNGVPVSNFTYSPDTK > > -- > > > NM_001079803_gorGor1_18_19 51 0 0 > Supercontig_0039638:17387-17539+ > > NXIXNELVXVTSEGAGLQLQKVTVLGVATAPQQVXSNGVPVSNFTYSPDTK > > -- > > > NM_001079804_gorGor1_18_19 51 0 0 > Supercontig_0039638:17387-17539+ > > NXIXNELVXVTSEGAGLQLQKVTVLGVATAPQQVXSNGVPVSNFTYSPDTK > > > > > > There are more than 30,000 sequences with X like that. Please help. > Thanks! > > > > Sean > > > > Sean (Xiang) Li, Ph.D > > Bioinformatics Scientist > > Ambry Genetics > > [email protected] <mailto:[email protected]> <mailto:[email protected]> > > Direct 949-900-5504 > > Fax 949-900-5501 > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
