[pygr] BLAST commentary

C. Titus Brown Wed, 27 May 2009 21:24:54 -0700

Chris et al.,

ok, some more BLAST comments.  (Sorry for the e-mail deluge -- when
working with a new section of the pygr code, I kind of go into a fugue
state where I just note down new ideas for later expansion.)


--

First, I dislike this interface:

   >>> al = blastmap(None, queryDB=db)

in order to distinguish the database query from the sequence query.
How about making it

   >>> al = blastmap[some_sequence]

OR

   >>> al = blastmap(seq=some_sequence)

OR

   >>> al = blastmap(queryDB=db)

with the first two returning the same thing?

I find the conflation of __getitem__ and __call__ behavior a bit odd but
not unpleasant ;).  On the other hand I dislike *having* to specify a
parameter as 'None' in order to get it to pay attention to a different
parameter.

--

Second, the handling of self-matches when BLASTing within a single
database is odd.

a) The warning message that you get when doing

   blastmap = blast.BlastMapping(db)
   blastmap(None, queryDB=db)

doesn't show up when you pick a sequence out of 'db':

   seq = db['some_sequence']
   blastmap(seq)

...even though it would be consistent to have it there.

b) I think it's a bad idea to throw away self-matches anyway.  If it's a
choice between pygr internal consistency ("we don't put self-matches
into the NLMSA") and pygr consistency with BLAST ("BLAST returns the
match of seq with itself; why am I not seeing that?") I think we should
go with the latter.

c) Note the duplicates in the doctest output, below:

>>> al = blastmap(None, queryDB=db)
>>> for seq in db.values():
...    for (src, dest, edge) in al[seq].edges():
...       print repr(src), repr(dest)
gapped[0:40] ungapped[0:40] 
gapped[0:40] ungapped[0:40]                          ## DUP
gapped[44:74] ungapped[40:70]
gapped[44:74] ungapped[40:70]                          ## DUP
ungapped[0:40] gapped[0:40]
ungapped[0:40] gapped[0:40]                          ## DUP
ungapped[40:70] gapped[44:74]
ungapped[40:70] gapped[44:74]                          ## DUP

Those are there because adding 'ungapped' -> 'gapped' into an NLMSA
automatically adds 'gapped' -> 'ungapped', too (i.e. a->b adds b->a).
Because BLAST isn't a symmetric algorithm -- a matching b doesn't
necessarily mean b matches a -- this is sort of legitimate; the edge
info may in fact be different between the "DUP" rows above.  It's still
confusing, though.

However, I feel like this is inconsistent with the removal of the
self-matches: either we should include everything that BLAST returns, or
remove some of the obvious crap from what BLAST returns.  If I apply my
(b) criterion, I'd go with leaving it all in there and just writing good
docs.

Thoughts?

--

Third, I think the BlastxMapping interface is not yet so useful.  Right
now it returns a list of NLMSASlices, which are hard to use without a
reasonably complete understanding of NLMSAs.  (At least, it was tricky
for me to figure out.)  I understand *why* this is, but I came up with
an alternative.

(See the attached 'blastfoo.py' for fully functioning code.)

OK, let's set up a blastx:

   dna_db = seqdb.SequenceFileDB('data/hbb1_mouse.fa')
   dna_seq = dna_db['gi|171854975|dbj|AB364477.1|']
   prot_db = seqdb.SequenceFileDB('data/sp_all_hbb')
   rat_prot = prot_db['HBB2_RAT']

   map = blast.BlastxMapping(prot_db)

Now, retrieve the matches for the given DNA sequence:

   dna_matches = map[dna_seq]

As things are currently, dna_matches is a list of NLMSASlice objects.
If I want to retrieve matches to a particular protein (rat_prot), I need
to iteratively search for slices containing that protein:

   for slice in dna_matches:
       if rat_prot in slice:
           edges = slice.edges()
           print edges

If, however, I construct two NLMSAs, one an MSA containing the
translated frames mapped to the proteins, and the other an annotation
map containing dna mapped to translations, I can do the queries much
more pygr-ishly.  For example, to query by protein and retrieve the
aligned DNA sequence, I could do:

   for src, dest, edge in trans2prot[rat_prot].edges():
       print repr(dna2trans[dest].keys()[0])

or I could query by DNA sequence, and retrieve protein matches:

   for translated_frame in dna2trans[dna_seq]:
       edges = trans2prot[translated_frame].edges()
       for (translated_ival, match_ival, edge) in edges:
           print (translated_ival, match_ival)

Again, fully functioning code to do all of this (including build the
NLMSAs) is attached.

Does this seem like a reasonable interface to go with?  If so, we could
change the code to produce six translated frames per DNA sequence (->
annotation objects), map the protein matches onto *those* sequences,
and then return both sets of mappings.

Specifically, then, you would have something like this:

   map = blast.BlastxMapping(prot_db)
   (dna2trans, trans2prot) = map[dna_seq]

I think the idea is still a bit half-baked so some critical responses
would help me bake it further ;).  But it does strike me as being
pygr-ish in the sense of chaining graph queries.  And the current
interface for translated BLASTs is not good, IMO.

--

All right, that's it for the night...

cheers,
--titus

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

from pygr import blast, seqdb, cnestedlist

dna_db = seqdb.SequenceFileDB('data/hbb1_mouse.fa')
dna_seq = dna_db['gi|171854975|dbj|AB364477.1|']
prot_db = seqdb.SequenceFileDB('data/sp_all_hbb')
rat_prot = prot_db['HBB2_RAT']

map = blast.BlastxMapping(prot_db)
dna_matches = map[dna_seq]

###

#
# build an NLMSA containing all of the annotations mapped onto the
# proteins.  This is particularly useful because we can now query
# this NLMSA by the protein sequence.
#

trans2prot = cnestedlist.NLMSA('foo', mode='memory')

for slice in dna_matches:
    edges = slice.edges()
    for (src, dest, edge) in edges:
        trans2prot += src
        trans2prot[src] += dest
        
trans2prot.build()

###

#
# also build an NLMSA containing the translations (src -> seq).  We can
# query this by DNA sequence.
#

dna2trans = cnestedlist.NLMSA('foo2', mode='memory')

for slice in dna_matches:
    annot = slice.seq
    seq = annot.sequence

    dna2trans += seq
    dna2trans[seq] += annot

dna2trans.build()

###

# without any NLMSAs, you have to iterate over all of the results
# and check each one.
for slice in dna_matches:
    if rat_prot in slice:
        edges = slice.edges()
        print edges

# with both NLMSAs, you can query by protein and retrieve the aligned
# DNA sequence(s)
for src, dest, edge in trans2prot[rat_prot].edges():
    print repr(dna2trans[dest].keys()[0])

# ...or you can query by DNA sequence and then retrieve translations and
# translations-to-protein matches:
for translated_frame in dna2trans[dna_seq]:
    edges = trans2prot[translated_frame].edges()
    for (translated_ival, match_ival, edge) in edges:
        print (translated_ival, match_ival)

[pygr] BLAST commentary

Reply via email to