Here is what the man from the NCBI said:
---------- Forwarded message ---------- From: Peter Cooper <[email protected]> Date: 2009/3/20 Subject: Re: [blast-help] Fwd: Question about the definition of 'gaps' in blast -m8 output... To: [email protected] Cc: [email protected] Hello, The number reported tin the -m 8 output is the number of gap openings. This will only equal the number of gap characters if the length of each gap is 1. Peter ------------------------------- Peter S. Cooper, Ph.D. Public Services The National Center for Biotechnology Information 301-435-5951 On Mar 19, 2009, at 12:04 PM, romiti wrote: > > > Begin forwarded message: > >> From: User Services Service Account <[email protected]> >> Date: March 19, 2009 10:23:01 AM EDT >> To: [email protected] >> Subject: Question about the definition of 'gaps' in blast -m8 output... >> Reply-To: User Services Service Account <[email protected]> >> >> >> ------------- Begin Forwarded Message ------------- >> >> Date: Wed, 18 Mar 2009 19:31:01 +0000 >> Subject: Question about the definition of 'gaps' in blast -m8 output... >> From: Dan Bolser <[email protected]> >> To: [email protected], [email protected], >> [email protected] >> >> >> Hi, >> >> I'm sure this question comes up again and again, but searching the BioPerl >> mailing list didn't turn up any answers (to the second question). Basically >> I want to manually merge HSP's into 'contigious hits', and I want to look at >> the effect of various parameters on an algorithm to do this. This task >> prompted me to run a 'sanity check' on the blast data that I had, and I >> found that this check fails to fulfil my expectation of the data. This means >> that either I don't understand the data or the results are buggy. >> >> Can someone clarify the definition of the 'gaps' column in the blast -m8 >> output format for me? >> >> I thought that the column 'gaps' was basically the number of columns in the >> HSP that contains a gap character. To test this on my data, I checked the >> following equality: >> >> GAPS + 2 = >> LENGTH - abs(QUERY_END - QUERY_START) + LENGTH - abs(HIT_END - HIT_START) >> >> >> This says that the number of GAPS should be equal to the difference between >> the LENGTH of the alignment minus the distance between the START and END >> point on either the QUERY or the HIT (+2 for the 'off by one' error >> introduced by the two END-START calculations). >> >> e.g. >> >> 10-> MMMMMMMM**MMMM*M <-22 >> |||| || | | | >> 20-> MMMM**MMMMM*M*MM <-31 >> >> >> where MISMATCHES = 7, LENGTH = 16, QUERY_END - QUERY_START = 12, and HIT_END >> - HIT_START = 11. The formula gives: >> >> 7+2(9) = 16-12(4) + 16-11(5) >> >> >> The formula is correct for 11,282 out of 12,745 HSPs in my dataset (89%), >> however it fails for 1,463 cases (11%). Each of these cases has a value of >> MISMATCHES smaller than calculated by the formula. The difference is usually >> 1 or 2, but is seen to go as high as 96, and scales roughly linearly with >> the size of GAPS. >> >> >> Did I misunderstand what the value of GAPS is supposed to mean? How come it >> does apparently mean what I thought for so much of the data? >> >> >> Thanks very much for any help on the above. >> >> Dan. >> >> ------------- End Forwarded Message ------------- _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
