Hi Nimrod,

The psl format contains all of the information about an alignment
between two sequences.  Even a small gap on either side (the target or
the query sequence) will cause a new block in a psl, so blocks do not
necessarily correspond to exons.

I'm not sure if you saw this mailing list response previously, but it 
looks like it might be particularly useful:
https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019496.html

Also, we recently came across the "SeattleSeq Annotation" site, which 
might already have the exact tools you are looking for:

http://gvs.gs.washington.edu/SeattleSeqAnnotation/index.jsp

I hope this helps!

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 06/17/10 06:26, nimrod rubinstein wrote:
> hi brooke,
> 
> thanks for the clarification.
> 
> just one question to make sure i fully understand the structure of the
> refSeqAli.txt file: there are fields for describing the alignment blocks. is
> it always the rule that each block is an exon? or do blocks simply denote
> regions that align to the genome above the preselected threshold - and in
> that case a certain exon may actually span several alignment blocks?
> 
> thanks,
> nimrod
> 
> 
> 
> On Tue, Jun 15, 2010 at 3:35 AM, Brooke Rhead <[email protected]> wrote:
> 
>> Hi Nimrod,
>>
>> Ah, sorry for misunderstanding what you are trying to do!
>> Unfortunately, the person here who has done the most work on the SNP
>> tracks and who could best answer your questions is not available for the
>> next several weeks, but we still may be able to point you in the right
>> direction.
>>
>> I should clarify that the snp130CodingDbSnp table was built using
>> annotations directly from dbSNP, so, while there is a description of how
>> we built it (located in src/hg/makeDb/doc/hg18.txt in the Genome Browser
>> source code), it is likely not what you are looking for.  We could point you
>> to the portion of the code that is used to generate the "UCSC's predicted
>> function relative to selected gene tracks" portion of the SNP details page,
>> if you think that would be useful to you.
>>
>> One major change to your process that I can suggest is to start with the
>> refSeqAli table rather than the refGene table to determine the mRNA
>> coordinate.  The refGene table is a gene prediction table created from
>> refSeqAli, and alignment information present in refSeqAli is lost in
>> refGene.  The refSeqAli table is in psl format (
>> http://genome.ucsc.edu/FAQ/FAQformat.html#format2), which retains all of
>> the alignment information, and will enable you to go from a genomic
>> coordinate to the correct mRNA coordinate.
>>
>>
>> --
>> Brooke Rhead
>> UCSC Genome Bioinformatics Group
>>
>>
>> On 06/12/10 01:25, nimrod rubinstein wrote:
>>
>>> thanks for the quick response,
>>>
>>> actually i am using snp130, but in my data i also have SNPs that do not
>>> exist in snp130. i guess what i am trying to do (explained in my last
>>> email)
>>> is similar to what was performed  in order to build the snp130CodingDbSnp.
>>> is there any description for that?
>>>
>>> thanks again,
>>> nimrod
>>>
>>>
>>>
>>> On Sat, Jun 12, 2010 at 3:10 AM, Brooke Rhead <[email protected]> wrote:
>>>
>>>  Hi Nimrod,
>>>> The snp130 table contains dbSNP's annotations on each SNP's predicted
>>>> functional role (in the 'func' field), which includes whether the SNP is
>>>> coding-synonymous, coding-nonsynonymous, in a 5' or 3' UTR, in an intron,
>>>> just near a gene, etc.  (See the SNP 130 track description for a full
>>>> list).
>>>>  dbSNP uses RefSeq Genes to make these predictions.
>>>>
>>>> For determining the amino acid changes, I am happy to report that there
>>>> is
>>>> a somewhat new table in the hg18 database that already has the exact
>>>> information you are looking to extract: snp130CodingDbSnp.
>>>>
>>>> This table is what the Genome Browser uses to display coding changes when
>>>> you click on a SNP and look at the details page.  For instance, if you
>>>> click
>>>> on rs17852585 in the Genome Browser and scroll down, you will see:
>>>>
>>>> Coding annotations by dbSNP:
>>>> NM_000808: missense L (CTC) --> P (CCC)
>>>>
>>>> (Note that you can also see predicted coding changes for *any* gene or
>>>> gene
>>>> prediction track by clicking "Go to SNPs (130) track controls" and making
>>>> selections in the "On details page, show function and coding differences
>>>> relative to..." boxes.  This information is not stored in any table -- it
>>>> is
>>>> generated on the fly when you click on a SNP.)
>>>>
>>>> I think that between the snp130 table and the snp130CodingDbSnp table,
>>>> you
>>>> should be able to find what you are looking for.  If you have any further
>>>> questions, please feel free to write back to [email protected].  And
>>>> thank you for searching the mailing list archives before asking your
>>>> question!
>>>>
>>>> --
>>>> Brooke Rhead
>>>> UCSC Genome Bioinformatics Group
>>>>
>>>>
>>>> On 06/11/10 05:40, nimrod rubinstein wrote:
>>>>
>>>>  hi,
>>>>> i have a list of SNPs and their locations on hg18. i'd like to
>>>>> use ucsc data to find out for each SNP whether it falls in a
>>>>> known gene and if so in which of the following regions:
>>>>> 5'utr/coding sequence/intron/3'utr. if it does fall inside the
>>>>> coding sequence i would additionally like to know whether
>>>>> it is a synonymous SNP or not, and if not what is the resulting
>>>>> amino acid
>>>>>
>>>>> i read through the mailing archives and understood its best to
>>>>> use refGene
>>>>> and refMrna for this task: for a given SNP coordinate i first
>>>>> check whether it falls inside any of refGene's transcription
>>>>> boundaries. if it does, i then determine in which region of the
>>>>> gene. if it falls inside one of the coding exons i then extract
>>>>> the relevant codon from refMrna - and here's where i'm stuck:
>>>>>
>>>>> according to the coordinates in refGene i might determine that
>>>>> the SNP is
>>>>> in e.g., the 5'utr but according to the coordinates in the CDS
>>>>> file it may turn out that it's actually in the coding
>>>>> sequence.and the other way around (plus other similar
>>>>> combinations of that problem concerning the 3'utr and intron
>>>>> regions).
>>>>>
>>>>> i understand that the genomic coordinates in refGene are the
>>>>> result of BLAT and those in the CDS file are local coordinates
>>>>> from NCBI. since the mapping of NCBI mRNAs to the genome is
>>>>> imperfect these location discrepancies occur.
>>>>>
>>>>> so, if my description is correct is there any solution to my
>>>>> problem? if i understood or am doing something wrong i would
>>>>> greatly appreciate your corrections.
>>>>> thank you very much for your time and help
>>>>> Nimrod Rubinstein
>>>>> The Department of Cell Research and Immunology
>>>>> Tel Aviv University
>>>>> _______________________________________________
>>>>> Genome maillist  -  [email protected]
>>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>>>
>>>>>  _______________________________________________
>>> Genome maillist  -  [email protected]
>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>
>>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to