Hey, Mary,

 

Thank you tremendously!  

 

Regarding the 2nd email, actually I was asking why there are letters
after a "Z".  There are 217273 such kind of cases, such as

 

$ grep -B1 Z[A-Z] refGene.exonAA.fa

>NM_000152_echTel1_14_19 50 0 2 scaffold_298195:1056-1143+

PQEPYRFGEQAQSAMRKAL-LRYALLPZL---------------------

 

>NM_001080397_dipOrd1_2_8 31 1 1 scaffold_7684:5-74+

--------GAZSDRC-SRFG--RPFI-VLAI

 

I would think no amino acids after "Z". Right?

 

Best

 

Sean

 

 

From: Mary Goldman [mailto:[email protected]] 
Sent: Thursday, July 07, 2011 11:50 AM
To: Xiang Li
Cc: [email protected]
Subject: Re: [Genome] [help] Lots of stop codons in multiz46way protein
alignment file

 

Hi Sean,

Yes, you can get the entire genome alignment here:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/. Beware,
the compressed data size of these files is 31 Gb and uncompressed is
more than 250 Gb. For a description of multiple alignment format (MAF),
see http://genome.ucsc.edu/goldenPath/help/maf.html.

Also, in response to your other email, stop codons are represented with
a Z. Information about this file format, including non-protein
characters can be found here:
http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA.

I hope this information is helpful. Please contact us again at
[email protected] if you have any further questions.

Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group

On 7/6/11 4:38 PM, Xiang Li wrote: 

Hi, Mary,

 

I got it. Thanks a lot!    

 

BTW, from the webpage you pointed to me, it seems there are multiple
alignments at the DNA level between entire genomes, i.e., not only just
for CDS regions, but also for entire exon and intronic regions. 

 

Is my understanding correct? If so, could you please instruct me how to
get that MAFs? 

 

Thanks!

 

 

Sean

 

From: Mary Goldman [mailto:[email protected]] 
Sent: Wednesday, July 06, 2011 4:30 PM
To: Xiang Li
Cc: [email protected]
Subject: Re: [Genome] [help] Lots of stop codons in multiz46way protein
alignment file

 

Hi Sean,

You can also view the Gorilla browser at our preview site here:
http://genome-preview.cse.ucsc.edu/cgi-bin/hgTracks?db=gorGor1. It tends
to be more reliably available than our test site. Our preview site
carries the same warning that tracks and data on the test server have
not undergone formal quality assurance. 

Best,
Mary
---------------------
Mary Goldman
UCSC Bioinformatics Group

On 7/6/11 4:16 PM, Mary Goldman wrote: 

Hi Sean,

Codons with an N in any position are represented with an X (stop codons
are represented with a Z). Assemblies that are not well sequenced, such
as the Gorilla (gorGor1) will have quite a few Ns (which are bases with
low quality scores) and, thus, quite a few Xs in the protein alignment
file. You can confirm this by viewing the gorGor1 assembly on our test
browser here:
http://genome-test.cse.ucsc.edu/cgi-bin/hgTracks?db=gorGor1. Please note
that tracks and data on the test server have not undergone formal
quality assurance. 

More information about this file format can be found here:
http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA. 

I hope this information is helpful. Please contact us again at
[email protected] if you have any further questions.

Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group



On 7/6/11 3:21 PM, Xiang Li wrote: 

Hi, Dear Support,
 
 
 
It would be easy to understand if they are at the end of a protein
sequence. However, could you please help me understand why there are so
many "X"es inside some sequences?
 
 
 
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/alignments/re
fGene.exonAA.fa.gz
 
 
 

        NM_000152_gorGor1_18_19 51 0 0 Supercontig_0039638:17387-17539+

NXIXNELVXVTSEGAGLQLQKVTVLGVATAPQQVXSNGVPVSNFTYSPDTK
 
--
 

        NM_001079803_gorGor1_18_19 51 0 0
Supercontig_0039638:17387-17539+

NXIXNELVXVTSEGAGLQLQKVTVLGVATAPQQVXSNGVPVSNFTYSPDTK
 
--
 

        NM_001079804_gorGor1_18_19 51 0 0
Supercontig_0039638:17387-17539+

NXIXNELVXVTSEGAGLQLQKVTVLGVATAPQQVXSNGVPVSNFTYSPDTK
 
 
 
 
 
There are more than 30,000 sequences with X like that.   Please help.
Thanks!
 
 
 
Sean
 
 
 
Sean (Xiang) Li, Ph.D
 
Bioinformatics Scientist
 
Ambry Genetics
 
[email protected] <mailto:[email protected]> <mailto:[email protected]>  
 
Direct 949-900-5504
 
Fax 949-900-5501
 
 
 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to