Hi Nimrod, The psl format contains all of the information about an alignment between two sequences. Even a small gap on either side (the target or the query sequence) will cause a new block in a psl, so blocks do not necessarily correspond to exons.
I'm not sure if you saw this mailing list response previously, but it looks like it might be particularly useful: https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019496.html Also, we recently came across the "SeattleSeq Annotation" site, which might already have the exact tools you are looking for: http://gvs.gs.washington.edu/SeattleSeqAnnotation/index.jsp I hope this helps! -- Brooke Rhead UCSC Genome Bioinformatics Group On 06/17/10 06:26, nimrod rubinstein wrote: > hi brooke, > > thanks for the clarification. > > just one question to make sure i fully understand the structure of the > refSeqAli.txt file: there are fields for describing the alignment blocks. is > it always the rule that each block is an exon? or do blocks simply denote > regions that align to the genome above the preselected threshold - and in > that case a certain exon may actually span several alignment blocks? > > thanks, > nimrod > > > > On Tue, Jun 15, 2010 at 3:35 AM, Brooke Rhead <[email protected]> wrote: > >> Hi Nimrod, >> >> Ah, sorry for misunderstanding what you are trying to do! >> Unfortunately, the person here who has done the most work on the SNP >> tracks and who could best answer your questions is not available for the >> next several weeks, but we still may be able to point you in the right >> direction. >> >> I should clarify that the snp130CodingDbSnp table was built using >> annotations directly from dbSNP, so, while there is a description of how >> we built it (located in src/hg/makeDb/doc/hg18.txt in the Genome Browser >> source code), it is likely not what you are looking for. We could point you >> to the portion of the code that is used to generate the "UCSC's predicted >> function relative to selected gene tracks" portion of the SNP details page, >> if you think that would be useful to you. >> >> One major change to your process that I can suggest is to start with the >> refSeqAli table rather than the refGene table to determine the mRNA >> coordinate. The refGene table is a gene prediction table created from >> refSeqAli, and alignment information present in refSeqAli is lost in >> refGene. The refSeqAli table is in psl format ( >> http://genome.ucsc.edu/FAQ/FAQformat.html#format2), which retains all of >> the alignment information, and will enable you to go from a genomic >> coordinate to the correct mRNA coordinate. >> >> >> -- >> Brooke Rhead >> UCSC Genome Bioinformatics Group >> >> >> On 06/12/10 01:25, nimrod rubinstein wrote: >> >>> thanks for the quick response, >>> >>> actually i am using snp130, but in my data i also have SNPs that do not >>> exist in snp130. i guess what i am trying to do (explained in my last >>> email) >>> is similar to what was performed in order to build the snp130CodingDbSnp. >>> is there any description for that? >>> >>> thanks again, >>> nimrod >>> >>> >>> >>> On Sat, Jun 12, 2010 at 3:10 AM, Brooke Rhead <[email protected]> wrote: >>> >>> Hi Nimrod, >>>> The snp130 table contains dbSNP's annotations on each SNP's predicted >>>> functional role (in the 'func' field), which includes whether the SNP is >>>> coding-synonymous, coding-nonsynonymous, in a 5' or 3' UTR, in an intron, >>>> just near a gene, etc. (See the SNP 130 track description for a full >>>> list). >>>> dbSNP uses RefSeq Genes to make these predictions. >>>> >>>> For determining the amino acid changes, I am happy to report that there >>>> is >>>> a somewhat new table in the hg18 database that already has the exact >>>> information you are looking to extract: snp130CodingDbSnp. >>>> >>>> This table is what the Genome Browser uses to display coding changes when >>>> you click on a SNP and look at the details page. For instance, if you >>>> click >>>> on rs17852585 in the Genome Browser and scroll down, you will see: >>>> >>>> Coding annotations by dbSNP: >>>> NM_000808: missense L (CTC) --> P (CCC) >>>> >>>> (Note that you can also see predicted coding changes for *any* gene or >>>> gene >>>> prediction track by clicking "Go to SNPs (130) track controls" and making >>>> selections in the "On details page, show function and coding differences >>>> relative to..." boxes. This information is not stored in any table -- it >>>> is >>>> generated on the fly when you click on a SNP.) >>>> >>>> I think that between the snp130 table and the snp130CodingDbSnp table, >>>> you >>>> should be able to find what you are looking for. If you have any further >>>> questions, please feel free to write back to [email protected]. And >>>> thank you for searching the mailing list archives before asking your >>>> question! >>>> >>>> -- >>>> Brooke Rhead >>>> UCSC Genome Bioinformatics Group >>>> >>>> >>>> On 06/11/10 05:40, nimrod rubinstein wrote: >>>> >>>> hi, >>>>> i have a list of SNPs and their locations on hg18. i'd like to >>>>> use ucsc data to find out for each SNP whether it falls in a >>>>> known gene and if so in which of the following regions: >>>>> 5'utr/coding sequence/intron/3'utr. if it does fall inside the >>>>> coding sequence i would additionally like to know whether >>>>> it is a synonymous SNP or not, and if not what is the resulting >>>>> amino acid >>>>> >>>>> i read through the mailing archives and understood its best to >>>>> use refGene >>>>> and refMrna for this task: for a given SNP coordinate i first >>>>> check whether it falls inside any of refGene's transcription >>>>> boundaries. if it does, i then determine in which region of the >>>>> gene. if it falls inside one of the coding exons i then extract >>>>> the relevant codon from refMrna - and here's where i'm stuck: >>>>> >>>>> according to the coordinates in refGene i might determine that >>>>> the SNP is >>>>> in e.g., the 5'utr but according to the coordinates in the CDS >>>>> file it may turn out that it's actually in the coding >>>>> sequence.and the other way around (plus other similar >>>>> combinations of that problem concerning the 3'utr and intron >>>>> regions). >>>>> >>>>> i understand that the genomic coordinates in refGene are the >>>>> result of BLAT and those in the CDS file are local coordinates >>>>> from NCBI. since the mapping of NCBI mRNAs to the genome is >>>>> imperfect these location discrepancies occur. >>>>> >>>>> so, if my description is correct is there any solution to my >>>>> problem? if i understood or am doing something wrong i would >>>>> greatly appreciate your corrections. >>>>> thank you very much for your time and help >>>>> Nimrod Rubinstein >>>>> The Department of Cell Research and Immunology >>>>> Tel Aviv University >>>>> _______________________________________________ >>>>> Genome maillist - [email protected] >>>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>>>> >>>>> _______________________________________________ >>> Genome maillist - [email protected] >>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>> >> > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
