Titus requested some usage examples, so here goes:
First, I'll summarize the new BlastMapping model that in pygr 0.8
replaces the old BlastDB model. Whereas BlastDB combined the
functions of a sequence database and blast search functionality,
BlastMapping separates the blast functionality completely from the
sequence database functionality. (BlastDB is still available for
backwards compatibility). Another class, MegablastMapping, provides
the exact same interface as BlastMapping, but performs repeat-masked
megablast nucleotide searches instead of standard BLAST.
BASIC USAGE: create a BlastMapping object for any seqDB that you want
to search using BLAST. This BlastMapping object acts like a Pygr graph
(i.e. dict) that takes any sequence object as a source node (key) and
maps it to homologous sequence intervals from seqDB as target nodes,
with edge information about each alignment, following the generic
pattern blastmap[src][target] = edgeInfo. You can use it either in
the mapping style:
blastmap = BlastMapping(seqDB) # create the mapping object
# myquery can be any Pygr sequence interval object...
for src,target,edge in blastmap[myquery].edges(): # run blast query
# print the aligned intervals or whatever you want to do...
print '%s %s\n%s %s\n' %(src,repr(src),target,repr(target))
Or you can use it as a callable, so you can pass search parameters
like expmax etc. This also lets you save many search results into one
NLMSA...
for myquery in lotsaQueries:
blastmap(myquery, expmax=1e-10, al=myNLMSA)
myNLMSA.build() # build alignment indexes when done saving results
Since all this uses NLMSA to store the results, any of the NLMSASlice
group-by operators can be applied as usual.
Just like BlastDB, BlastMapping determines which flavor of BLAST
(blastp or blastn) to run based on the sequence type (nucleotide vs.
protein) of the query and database. Currently, tblastn and blastx
won't work correctly, because of the translation issues outlined above.
The new proposal would make tblastn work exactly the same as blastp
and blastn (i.e. as in the example code above), but with one extra
feature: the aligned target intervals (which look like protein
sequence intervals, even though the target database is nucleotide)
would have a "sequence" attribute which yields the associated
nucleotide sequence interval from the target sequence database, e.g.
to print the target *nucleotide sequence coordinates* next to the
target *protein sequence string*, you would just change the code
trivially:
for src,target,edge in blastmap[myquery].edges():
print '%s %s\n%s %s %2.1f\n' %(src,repr(src),target,
repr(target.sequence),
100.*edge.pIdentity())
This example also prints the percent identity based of the *protein*
sequence of target vs. src.
Explanation: the target sequence interval is not an interval of seqDB
of course, because tblastn translates the subject nucleotide sequence
into a protein sequence. target represents that protein sequence
interval; you can slice it, extract its protein sequence string (by
str()) etc. in all the usual ways. But it is actually an Annotation
sequence object, bound to a real sequence object, namely its
associated interval of nucleotide sequence from seqDB, which as always
with an annotation, you can obtain by requesting the "sequence"
attribute. This takes advantage of the fact that in Pygr an
annotation is actually a subclass of sequence (SeqPath), so all the
sequence operation methods come along for free...
I'll address blastx / tblastx in a separate posting.
-- Chris
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---