We have a way to check for both within HSP objects I believe.
1) gaps() is documented to return the number of gap characters within
the query/hit/total
2) seq_inds('gap', 'hit/query') returns the number of gap positions,
with the position repeated for every gap character I believe, so
getting this in scalar context should be similar to gaps(). However,
to reduce that down to just the gap positions (no repeats) use the
collapse flag: seq_inds('gap', 'hit/query', 1)
chris
On Mar 20, 2009, at 12:13 PM, Dan Bolser wrote:
Here is what the man from the NCBI said:
---------- Forwarded message ----------
From: Peter Cooper <[email protected]>
Date: 2009/3/20
Subject: Re: [blast-help] Fwd: Question about the definition of 'gaps'
in blast -m8 output...
To: [email protected]
Cc: [email protected]
Hello,
The number reported tin the -m 8 output is the number of gap
openings. This will only equal the number of gap characters if the
length of each gap is 1.
Peter
-------------------------------
Peter S. Cooper, Ph.D.
Public Services
The National Center for Biotechnology Information
301-435-5951
On Mar 19, 2009, at 12:04 PM, romiti wrote:
Begin forwarded message:
From: User Services Service Account <[email protected]>
Date: March 19, 2009 10:23:01 AM EDT
To: [email protected]
Subject: Question about the definition of 'gaps' in blast -m8
output...
Reply-To: User Services Service Account <[email protected]>
------------- Begin Forwarded Message -------------
Date: Wed, 18 Mar 2009 19:31:01 +0000
Subject: Question about the definition of 'gaps' in blast -m8
output...
From: Dan Bolser <[email protected]>
To: [email protected], [email protected], [email protected]
Hi,
I'm sure this question comes up again and again, but searching the
BioPerl
mailing list didn't turn up any answers (to the second question).
Basically
I want to manually merge HSP's into 'contigious hits', and I want
to look at
the effect of various parameters on an algorithm to do this. This
task
prompted me to run a 'sanity check' on the blast data that I had,
and I
found that this check fails to fulfil my expectation of the data.
This means
that either I don't understand the data or the results are buggy.
Can someone clarify the definition of the 'gaps' column in the
blast -m8
output format for me?
I thought that the column 'gaps' was basically the number of
columns in the
HSP that contains a gap character. To test this on my data, I
checked the
following equality:
GAPS + 2 =
LENGTH - abs(QUERY_END - QUERY_START) + LENGTH - abs(HIT_END -
HIT_START)
This says that the number of GAPS should be equal to the
difference between
the LENGTH of the alignment minus the distance between the START
and END
point on either the QUERY or the HIT (+2 for the 'off by one' error
introduced by the two END-START calculations).
e.g.
10-> MMMMMMMM**MMMM*M <-22
|||| || | | |
20-> MMMM**MMMMM*M*MM <-31
where MISMATCHES = 7, LENGTH = 16, QUERY_END - QUERY_START = 12,
and HIT_END
- HIT_START = 11. The formula gives:
7+2(9) = 16-12(4) + 16-11(5)
The formula is correct for 11,282 out of 12,745 HSPs in my dataset
(89%),
however it fails for 1,463 cases (11%). Each of these cases has a
value of
MISMATCHES smaller than calculated by the formula. The difference
is usually
1 or 2, but is seen to go as high as 96, and scales roughly
linearly with
the size of GAPS.
Did I misunderstand what the value of GAPS is supposed to mean?
How come it
does apparently mean what I thought for so much of the data?
Thanks very much for any help on the above.
Dan.
------------- End Forwarded Message -------------
_______________________________________________
Bioperl-l mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
_______________________________________________
BBB mailing list
[email protected]
http://www.bioinformatics.org/mailman/listinfo/bbb