Galt, as I mentioned in my first email, I am talking about Genbank mRNA "BM451627". If you go to the track description for mRNAs, you read:
-- Methods GenBank human mRNAs were aligned against the genome using the blat program. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. -- I dont need a manual for blat or postfilterings, I want to know what you are doing for the genome browser alignments -- and whether I interprete it right that you keep both alignments of an 1200nt mRNA if one is a 150nt subsequence with X% identity (over that 150nt) and the other alignment is a 1200nt alignment with also X% identity (over the 1200nt), supposed of course that X>.96. Ei, I seem to express myself very complicated. micha. En/na Galt Barber ha escrit: > > How does "further filtered" translate into "no control" ? > How does BLAT-output-behavior become "kept at UCSC" ? > > pslCDnaFilter for example can filter for coverage as well as many > other things. > > If you look in the kent/src/hg/makeDb/doc/ directory > you will seem plenty examples of what UCSC does with BLAT. > > BLAT can filter on %ID, it can filter on score. > You can use this to throw out really terrible > low-scoring alignments. > > But even just for normal use, people filter their BLAT output > for desired properties. > > Real mRNAs for example have very long introns between short exons. > > Take it to heart when I tell you that you need to post-filter psls. > > -- > > Here's an example of help screen from pslReps (included in blatSrcXX.zip) > > [hgwdev:blatSrc34> pslReps > pslReps - analyse repeats and generate genome wide best > alignments from a sorted set of local alignments > usage: > pslReps in.psl out.psl out.psr > where in.psl is an alignment file generated by psLayout and > sorted by pslSort, out.psl is the best alignment output > and out.psr contains repeat info > options: > -nohead don't add PSL header > -ignoreSize Will not weigh in favor of larger alignments so much > -noIntrons Will not penalize for not having introns when calculating > size factor > -singleHit Takes single best hit, not splitting into parts > -minCover=0.N minimum coverage to output. Default is 0. > -ignoreNs Ignore 'N's when calculating minCover. > -minAli=0.N minimum alignment ratio > default is 0.93 > -nearTop=0.N how much can deviate from top and be taken > default is 0.01 > -minNearTopSize=N Minimum size of alignment that is near top > for alignment to be kept. Default 30. > -coverQSizes=file Tab-separate file with effective query sizes. > When used with -minCover, this allows polyAs > to be excluded from the coverage calculation > > pslCDnaFilter has many more options even than pslReps. > > -- > > Here's an example sketch of BLAT use at UCSC. > > 1. mkdir someplace > > 2. faSplit query.fa and database.fa to make smaller chunks of e.g. genome > > 3. run gensub to create parasol batch-file > pairing every query chunk against every db chunk. > > 4. run BLAT cluster job using parasol, each job > just BLATs one chunk-pair. > > 5. Might need to use liftUp to correct coordinates for > earlier splitting step. > > 6. cat all the results together into a big psl file. > > 7 use pslCDnaFilter or pslReps to filter the psls > to have the desired characteristics. > > -Galt > > > On Mon, 10 Nov 2008, Micha Sammeth wrote: > >> Hi Galt, >> >> somehow we seem to miss each other in understanding. My question is >> exactly targeting at the in my eyes missing filtering of the blat >> output for multiple alignments of the same query with equal >> identities but strongly differing coverages of the query. Do I >> understand correctly that there is no control at all, and all >> alignments with equal identities (>96% resp. >maxIdentity-0.5% of the >> aligned stretch) are kept at UCSC? >> >> Thank you, micha. >> >> En/na Galt Barber ha escrit: >>> >>> yes, it is common that the blat output >>> is further filtered by some method. >>> >>> -Galt >>> >>> >>> On Mon, 10 Nov 2008, Micha Sammeth wrote: >>> >>>> Hi Galt, >>>> >>>> thank you for the quick answer, I prefer programs to scripts. Did I >>>> understand correctly that the sketched scenario of 10% of a trancript >>>> aligning with the same identity as 100% of a transcript and both >>>> alignments are kept occurs at UCSC? Is it frequent? >>>> >>>> Thank you, micha. >>>> >>>> >>>> En/na Galt Barber ha escrit: >>>>> >>>>> Percent Identity only applies regions aligned. >>>>> Gaps are not considered aligned regions of course. >>>>> >>>>> You are interested in the level of coverage. >>>>> You can use utilities like pslReps and pslCDnaFilter >>>>> to post-filter your BLAT psl results. You can also >>>>> just write your own script to filter it however you like. >>>>> >>>>> You might also try the BLAT options -maxGap=0 and -fastMap. >>>>> >>>>> -Galt >>>>> >>>>> >>>>> On Mon, 10 Nov 2008, Micha Sammeth wrote: >>>>> >>>>>> Hello helpdesk, >>>>>> >>>>>> I have again a curiosity concerning blat alignments. I consider >>>>>> the case >>>>>> BM451627. It was for some reasons not in the dataset of hg17, so I >>>>>> downloaded the sequence and ran a blat -stepSize=5 -minScore=0 >>>>>> -minIdentity=0, which should correspond to the settings used at >>>>>> UCSC. >>>>>> Checking idendity, I find the highest score of about 20% sequence >>>>>> identity -- match/query length -- it also does not change much when >>>>>> additionally taking into account repmatch. >>>>>> >>>>>> However, I found BM451627 in the UCSC hg16 database, where it >>>>>> reports a >>>>>> ~98% identity match. Looking closer at the alignment, in hg16 >>>>>> there is a >>>>>> ~150nt stretch from the 1244nt which aligns with 98% identity --- >>>>>> and >>>>>> probably a couple of bases that changed from hg16->hg17 are >>>>>> responsible >>>>>> that in this 150nt region sequence identity drops below the >>>>>> threshold of >>>>>> 96%. >>>>>> >>>>>> My question now is: >>>>>> >>>>>> Does this hold for all identities, say a transcript aligns with >>>>>> 1000 nt >>>>>> and 98% identity in one place and in another place with 100nt at 98% >>>>>> will be put in both places, regardless of the coverage of the >>>>>> transcript >>>>>> by the alignment? In other words, the identity criterion of 96% >>>>>> or 0.5% >>>>>> of the best alignment is applied to match/(Qend-Qstart)? And if >>>>>> so, what >>>>>> was the motivation to not take the "global identity" of the >>>>>> query, did >>>>>> you have bad experiences with transcripts that did not want to align >>>>>> that way? >>>>>> >>>>>> Thank you! >>>>>> >>>>>> micha. >>>>>> >>>>>> _______________________________________________ >>>>>> Genome maillist - [email protected] >>>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>>>>> >>>>> >>>> >>>> -- >>>> O o O o O o Dr. Michael Sammeth >>>> | O o | | O o | | O o | http://www.sammeth.net >>>> | | O | | | | O | GRIB| O | Phone: +34-933-160-166 >>>> | o O | | o O | | o O | Fax: +34 933-969-983 >>>> o O o O o O Dr. Aiguader 88, 08003 Barcelona >>>> >>>> _______________________________________________ >>>> Genome maillist - [email protected] >>>> http://www.soe.ucsc.edu/mailman/listinfo/genome >>>> >>> >> >> -- >> O o O o O o Dr. Michael Sammeth >> | O o | | O o | | O o | http://www.sammeth.net >> | | O | | | | O | GRIB| O | Phone: +34-933-160-166 >> | o O | | o O | | o O | Fax: +34 933-969-983 >> o O o O o O Dr. Aiguader 88, 08003 Barcelona >> > -- O o O o O o Dr. Michael Sammeth | O o | | O o | | O o | http://www.sammeth.net | | O | | | | O | GRIB| O | Phone: +34-933-160-166 | o O | | o O | | o O | Fax: +34 933-969-983 o O o O o O Dr. Aiguader 88, 08003 Barcelona _______________________________________________ Genome maillist - [email protected] http://www.soe.ucsc.edu/mailman/listinfo/genome
