Here's your answer:
> Only alignments having a base identity level within 0.5% of the best
> [...] were kept.
This is poorly worded. It may even be an editing mistake.
I am guessing that "base identity level" should have said "score"
or something like that.
What it is supposed to mean is that coverage is also
included. If a single query sequence aligns at say eight places,
and has alignments with various scores like
1100, 990, 980, 990, 1097, 500, 320, 50, 1100
Remember the higher the coverage, the higher the score will be.
At 0.5% it will keep the two that are the best scoring 1100,
and it will keep anything greater than 99.5% of the topscore 1100,
so the cut-off score is 1100 * 0.995 = 1094.5.
So the 1097 alignment will also be kept.
The genbank alignment pipeline is definitely using pslCDnaFilter.
The refseq native mrna's are using these pslCDnaFilter parameters:
-minId=0.95 -minCover=0.25 -globalNearBest=0.0010 -minQSize=20
-minNonRepSize=16 -ignoreNs -bestOverlap -polyASizes
So, the globalNearBest here is really strict, only keeping
the top 99.9%.
On top of that, it must have 95% Identity along the aligning parts,
it must have at least 25% coverage of the query sequence,
queries must be at least 20bp long, Ns are ignored in calculating
score and coverage, and there are some other params
that I won't go into here.
Here are all the pslCDnaFilter settings that hg18 genbank alignments
are using:
# finished assemblies
var.finished.refseq.mrna.native.pslCDnaFilter = -minId=0.95
-minCover=0.25 ${globalNB.refseq.mrna.native.pslCDnaFilter}
var.finished.refseq.mrna.xeno.pslCDnaFilter = -minId=0.35
-minCover=0.25 ${globalNB.refseq.mrna.xeno.pslCDnaFilter}
var.finished.genbank.mrna.native.pslCDnaFilter = -minId=0.95
-minCover=0.25 ${globalNB.genbank.mrna.native.pslCDnaFilter}
var.finished.genbank.mrna.xeno.pslCDnaFilter = -minId=0.35
-minCover=0.25 ${globalNB.genbank.mrna.xeno.pslCDnaFilter}
var.finished.genbank.est.native.pslCDnaFilter = -minId=0.95
-minCover=0.25 ${globalNB.genbank.est.native.pslCDnaFilter}
var.finished.genbank.est.xeno.pslCDnaFilter = -minId=0.10
-minCover=0.10 ${globalNB.genbank.est.xeno.pslCDnaFilter}
# common paramters used for global near best filtering
var.globalNB.refseq.mrna.native.pslCDnaFilter = -globalNearBest=0.0010
${common.refseq.mrna.native.pslCDnaFilter}
var.globalNB.refseq.mrna.xeno.pslCDnaFilter = -globalNearBest=0.0100
${common.refseq.mrna.xeno.pslCDnaFilter}
var.globalNB.genbank.mrna.native.pslCDnaFilter = -globalNearBest=0.0025
${common.genbank.mrna.native.pslCDnaFilter}
var.globalNB.genbank.mrna.xeno.pslCDnaFilter = -globalNearBest=0.0100
${common.genbank.mrna.xeno.pslCDnaFilter}
var.globalNB.genbank.est.native.pslCDnaFilter = -globalNearBest=0.0025
${common.genbank.est.native.pslCDnaFilter}
var.globalNB.genbank.est.xeno.pslCDnaFilter = -globalNearBest=0.0100
${common.genbank.est.xeno.pslCDnaFilter}
# common paramters used for all assembly categories
var.common.refseq.mrna.native.pslCDnaFilter = -minQSize=20
-minNonRepSize=16 -ignoreNs -bestOverlap -polyASizes
var.common.refseq.mrna.xeno.pslCDnaFilter = -minQSize=20
-minNonRepSize=40 -ignoreNs -bestOverlap -polyASizes
var.common.genbank.mrna.native.pslCDnaFilter = -minQSize=20
-minNonRepSize=16 -ignoreNs -bestOverlap -polyASizes
var.common.genbank.mrna.xeno.pslCDnaFilter = -minQSize=20
-minNonRepSize=40 -ignoreNs -bestOverlap -polyASizes
var.common.genbank.est.native.pslCDnaFilter = -minQSize=20
-minNonRepSize=16 -ignoreNs -bestOverlap -polyASizes -usePolyTHead
var.common.genbank.est.xeno.pslCDnaFilter = -minQSize=20
-minNonRepSize=40 -ignoreNs -bestOverlap -polyASizes -usePolyTHead
-Galt
On Tue, 11 Nov 2008, Micha Sammeth wrote:
> Galt,
>
> as I mentioned in my first email, I am talking about Genbank mRNA
> "BM451627". If you go to the track description for mRNAs, you read:
>
> --
>
> Methods
>
> GenBank human mRNAs were aligned against the genome using the blat
> program. When a single mRNA aligned in multiple places, the alignment
> having the highest base identity was found. Only alignments having a
> base identity level within 0.5% of the best and at least 96% base
> identity with the genomic sequence were kept.
>
> --
>
> I dont need a manual for blat or postfilterings, I want to know what you
> are doing for the genome browser alignments -- and whether I interprete
> it right that you keep both alignments of an 1200nt mRNA if one is a
> 150nt subsequence with X% identity (over that 150nt) and the other
> alignment is a 1200nt alignment with also X% identity (over the 1200nt),
> supposed of course that X>.96.
>
> Ei, I seem to express myself very complicated.
>
> micha.
>
>
> En/na Galt Barber ha escrit:
>>
>> How does "further filtered" translate into "no control" ?
>> How does BLAT-output-behavior become "kept at UCSC" ?
>>
>> pslCDnaFilter for example can filter for coverage as well as many
>> other things.
>>
>> If you look in the kent/src/hg/makeDb/doc/ directory
>> you will seem plenty examples of what UCSC does with BLAT.
>>
>> BLAT can filter on %ID, it can filter on score.
>> You can use this to throw out really terrible
>> low-scoring alignments.
>>
>> But even just for normal use, people filter their BLAT output
>> for desired properties.
>>
>> Real mRNAs for example have very long introns between short exons.
>>
>> Take it to heart when I tell you that you need to post-filter psls.
>>
>> --
>>
>> Here's an example of help screen from pslReps (included in blatSrcXX.zip)
>>
>> [hgwdev:blatSrc34> pslReps
>> pslReps - analyse repeats and generate genome wide best
>> alignments from a sorted set of local alignments
>> usage:
>> pslReps in.psl out.psl out.psr
>> where in.psl is an alignment file generated by psLayout and
>> sorted by pslSort, out.psl is the best alignment output
>> and out.psr contains repeat info
>> options:
>> -nohead don't add PSL header
>> -ignoreSize Will not weigh in favor of larger alignments so much
>> -noIntrons Will not penalize for not having introns when calculating
>> size factor
>> -singleHit Takes single best hit, not splitting into parts
>> -minCover=0.N minimum coverage to output. Default is 0.
>> -ignoreNs Ignore 'N's when calculating minCover.
>> -minAli=0.N minimum alignment ratio
>> default is 0.93
>> -nearTop=0.N how much can deviate from top and be taken
>> default is 0.01
>> -minNearTopSize=N Minimum size of alignment that is near top
>> for alignment to be kept. Default 30.
>> -coverQSizes=file Tab-separate file with effective query sizes.
>> When used with -minCover, this allows polyAs
>> to be excluded from the coverage calculation
>>
>> pslCDnaFilter has many more options even than pslReps.
>>
>> --
>>
>> Here's an example sketch of BLAT use at UCSC.
>>
>> 1. mkdir someplace
>>
>> 2. faSplit query.fa and database.fa to make smaller chunks of e.g. genome
>>
>> 3. run gensub to create parasol batch-file
>> pairing every query chunk against every db chunk.
>>
>> 4. run BLAT cluster job using parasol, each job
>> just BLATs one chunk-pair.
>>
>> 5. Might need to use liftUp to correct coordinates for
>> earlier splitting step.
>>
>> 6. cat all the results together into a big psl file.
>>
>> 7 use pslCDnaFilter or pslReps to filter the psls
>> to have the desired characteristics.
>>
>> -Galt
>>
>>
>> On Mon, 10 Nov 2008, Micha Sammeth wrote:
>>
>>> Hi Galt,
>>>
>>> somehow we seem to miss each other in understanding. My question is
>>> exactly targeting at the in my eyes missing filtering of the blat
>>> output for multiple alignments of the same query with equal
>>> identities but strongly differing coverages of the query. Do I
>>> understand correctly that there is no control at all, and all
>>> alignments with equal identities (>96% resp. >maxIdentity-0.5% of the
>>> aligned stretch) are kept at UCSC?
>>>
>>> Thank you, micha.
>>>
>>> En/na Galt Barber ha escrit:
>>>>
>>>> yes, it is common that the blat output
>>>> is further filtered by some method.
>>>>
>>>> -Galt
>>>>
>>>>
>>>> On Mon, 10 Nov 2008, Micha Sammeth wrote:
>>>>
>>>>> Hi Galt,
>>>>>
>>>>> thank you for the quick answer, I prefer programs to scripts. Did I
>>>>> understand correctly that the sketched scenario of 10% of a trancript
>>>>> aligning with the same identity as 100% of a transcript and both
>>>>> alignments are kept occurs at UCSC? Is it frequent?
>>>>>
>>>>> Thank you, micha.
>>>>>
>>>>>
>>>>> En/na Galt Barber ha escrit:
>>>>>>
>>>>>> Percent Identity only applies regions aligned.
>>>>>> Gaps are not considered aligned regions of course.
>>>>>>
>>>>>> You are interested in the level of coverage.
>>>>>> You can use utilities like pslReps and pslCDnaFilter
>>>>>> to post-filter your BLAT psl results. You can also
>>>>>> just write your own script to filter it however you like.
>>>>>>
>>>>>> You might also try the BLAT options -maxGap=0 and -fastMap.
>>>>>>
>>>>>> -Galt
>>>>>>
>>>>>>
>>>>>> On Mon, 10 Nov 2008, Micha Sammeth wrote:
>>>>>>
>>>>>>> Hello helpdesk,
>>>>>>>
>>>>>>> I have again a curiosity concerning blat alignments. I consider
>>>>>>> the case
>>>>>>> BM451627. It was for some reasons not in the dataset of hg17, so I
>>>>>>> downloaded the sequence and ran a blat -stepSize=5 -minScore=0
>>>>>>> -minIdentity=0, which should correspond to the settings used at
>>>>>>> UCSC.
>>>>>>> Checking idendity, I find the highest score of about 20% sequence
>>>>>>> identity -- match/query length -- it also does not change much when
>>>>>>> additionally taking into account repmatch.
>>>>>>>
>>>>>>> However, I found BM451627 in the UCSC hg16 database, where it
>>>>>>> reports a
>>>>>>> ~98% identity match. Looking closer at the alignment, in hg16
>>>>>>> there is a
>>>>>>> ~150nt stretch from the 1244nt which aligns with 98% identity ---
>>>>>>> and
>>>>>>> probably a couple of bases that changed from hg16->hg17 are
>>>>>>> responsible
>>>>>>> that in this 150nt region sequence identity drops below the
>>>>>>> threshold of
>>>>>>> 96%.
>>>>>>>
>>>>>>> My question now is:
>>>>>>>
>>>>>>> Does this hold for all identities, say a transcript aligns with
>>>>>>> 1000 nt
>>>>>>> and 98% identity in one place and in another place with 100nt at 98%
>>>>>>> will be put in both places, regardless of the coverage of the
>>>>>>> transcript
>>>>>>> by the alignment? In other words, the identity criterion of 96%
>>>>>>> or 0.5%
>>>>>>> of the best alignment is applied to match/(Qend-Qstart)? And if
>>>>>>> so, what
>>>>>>> was the motivation to not take the "global identity" of the
>>>>>>> query, did
>>>>>>> you have bad experiences with transcripts that did not want to align
>>>>>>> that way?
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> micha.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Genome maillist - [email protected]
>>>>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> O o O o O o Dr. Michael Sammeth
>>>>> | O o | | O o | | O o | http://www.sammeth.net
>>>>> | | O | | | | O | GRIB| O | Phone: +34-933-160-166
>>>>> | o O | | o O | | o O | Fax: +34 933-969-983
>>>>> o O o O o O Dr. Aiguader 88, 08003 Barcelona
>>>>>
>>>>> _______________________________________________
>>>>> Genome maillist - [email protected]
>>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>>>
>>>>
>>>
>>> --
>>> O o O o O o Dr. Michael Sammeth
>>> | O o | | O o | | O o | http://www.sammeth.net
>>> | | O | | | | O | GRIB| O | Phone: +34-933-160-166
>>> | o O | | o O | | o O | Fax: +34 933-969-983
>>> o O o O o O Dr. Aiguader 88, 08003 Barcelona
>>>
>>
>
> --
> O o O o O o Dr. Michael Sammeth
> | O o | | O o | | O o | http://www.sammeth.net
> | | O | | | | O | GRIB| O | Phone: +34-933-160-166
> | o O | | o O | | o O | Fax: +34 933-969-983
> o O o O o O Dr. Aiguader 88, 08003 Barcelona
>
> _______________________________________________
> Genome maillist - [email protected]
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
_______________________________________________
Genome maillist - [email protected]
http://www.soe.ucsc.edu/mailman/listinfo/genome