[Genome] questions about parsing PSL file from BLAT

Xianjun Dong Tue, 21 Apr 2009 13:14:20 -0700

Hi,

I have two questions about parsing PSL file from BLAT:


1. how can I understand the percent ID and score calculation, intuitively?

 From the FAQBlat(http://genome.ucsc.edu/FAQ/FAQblat#blat4), I can 
understand the formula (for DNA alignment)
ID = 100.0 - pslCalcMilliBad(psl, TRUE) * 0.1 
as
ID = 100.0 - 100 * (misMatch+qNumInsert)/(match+repMatch+misMatch)
= 100 * (match+repMatch-qNumInsert) / (match+repMatch+misMatch),
Right? 

If my understanding is correct, could you help me understand the meaning 
of this percent ID in a simple way?
I tried to understand this ID as coverage of matched bases relative to 
the aligned part in query, but it's not.  From the PSL output file, I 
can see that the alignment length in query sequence (L) satisfies
L = (qEnd-qStart) = match + repMatch + misMatch + nCount + qNumInsert.
"match+repMatch+misMatch"  is the aligned part of full L. But what does 
"match+repMatch-qNumInsert" represent for?

The same question for score, which is {match + int(repMatch/2) - 
misMatch - qNumInsert - tNumInsert} as I understand. What does this mean?


2. What can be a better threshold to filter hits from multiple queries?

Percent ID and score are two criteria to assess a BLAT hit,  but I found 
it's hard to use any of these alone to define a threshold for filtering 
short hits from multiple queries. Obviously, 95% ID (for example) alone 
is not correctly, since some hits are very short, but 100% matched. 
While, the score seems to be an absolute value for each query. It's not 
possible to define a common score for all queries (which have different 
length themselves).  I am thinking if we can use, say
score / querySize
to define a common threshold (for example 95%) to  filter out those hits 
with small score. I was also thinking to use the highest score (for each 
query) as reference to filter out hits with a more-than-threshold 
decreased score. For example, all hits with score 60% lower than the 
highest score will be removed.  The highest score is calculated for each 
query.

Does anyone have experience in this problem?

Thanks in advance

Xianjun
_______________________________________________
Genome maillist  -  Genome@soe.ucsc.edu
https://lists.soe.ucsc.edu/mailman/listinfo/genome

[Genome] questions about parsing PSL file from BLAT

Reply via email to