Hi, I have two questions about parsing PSL file from BLAT:
1. how can I understand the percent ID and score calculation, intuitively? From the FAQBlat(http://genome.ucsc.edu/FAQ/FAQblat#blat4), I can understand the formula (for DNA alignment) ID = 100.0 - pslCalcMilliBad(psl, TRUE) * 0.1 as ID = 100.0 - 100 * (misMatch+qNumInsert)/(match+repMatch+misMatch) = 100 * (match+repMatch-qNumInsert) / (match+repMatch+misMatch), Right? If my understanding is correct, could you help me understand the meaning of this percent ID in a simple way? I tried to understand this ID as coverage of matched bases relative to the aligned part in query, but it's not. From the PSL output file, I can see that the alignment length in query sequence (L) satisfies L = (qEnd-qStart) = match + repMatch + misMatch + nCount + qNumInsert. "match+repMatch+misMatch" is the aligned part of full L. But what does "match+repMatch-qNumInsert" represent for? The same question for score, which is {match + int(repMatch/2) - misMatch - qNumInsert - tNumInsert} as I understand. What does this mean? 2. What can be a better threshold to filter hits from multiple queries? Percent ID and score are two criteria to assess a BLAT hit, but I found it's hard to use any of these alone to define a threshold for filtering short hits from multiple queries. Obviously, 95% ID (for example) alone is not correctly, since some hits are very short, but 100% matched. While, the score seems to be an absolute value for each query. It's not possible to define a common score for all queries (which have different length themselves). I am thinking if we can use, say score / querySize to define a common threshold (for example 95%) to filter out those hits with small score. I was also thinking to use the highest score (for each query) as reference to filter out hits with a more-than-threshold decreased score. For example, all hits with score 60% lower than the highest score will be removed. The highest score is calculated for each query. Does anyone have experience in this problem? Thanks in advance Xianjun _______________________________________________ Genome maillist - Genome@soe.ucsc.edu https://lists.soe.ucsc.edu/mailman/listinfo/genome