Hi Kenny,
I think I can explain this.
- it sounds like you are opening the same sequence database file
twice, creating two different BlastDB objects. Pygr will not consider
these "equal" (either by cmp() or by dict hash index), so you get the
KeyError you reported.
- This may seem counterintuitive, but it's a data integrity guarantee
-- unless they are the same object it is quite challenging to
guarantee that they are "the same data". For example, for a BlastDB,
just comparing the filenames is not adequate; that can give both false
positive and false negative errors. Just because the filename string
is identical doesn't mean the data are equal; data could have changed
on-disk in between the two file opening events. On the other hand two
different filename strings might actually resolve to the same path.
For that matter, completely different filenames might point to two
files that are exact copies of each other. To do the check right,
you'd have to recompute a hash on the entire dataset.
- pygr's solution to this problem is pygr.Data, i.e. allocate a single
unique ID to a resource, and always obtain it by requesting this
unique ID. pygr.Data guarantees that within an interpreter session
two different requests for the same ID will resolve to the same Python
object.
- to take advantage of this, you just need to be consistent in always
identifying a given resource with its pygr.Data ID. In your example,
I suspect you did something like the following: 1) you opened the file
by constructing a BlastDB, then used it to generate some blast
results, which you then pickled. This works, but it has no pygr.Data
ID, so there's no way for pygr.Data to apply the above data integrity
guarantee to it. 2) you then opened the file again (either by
constructing a BlastDB again, or getting it from pygr.Data) and you
tried to look up results from one in the other (presumably through the
intermediary of an NLMSA). This gave a KeyError.
- this just reflects a fundamental problem: imagine all the analyses
and data being done all over the world as a huge "web" of cross-
references. In other words, many of those different analyses are
actually being done on the same dataset; you can think of that as a
"node" that ties together many different analyses being done all over
the world. But ordinarily it would be very hard to run queries that
could traverse all these analyses, because the "metadata" that show
which data are actually the *same data* are not available. So the
"web of data" fragments into single, disconnected analyses that a
computer program cannot query in a way that traverses more than one
analysis at a time. pygr.Data tries to solve this basic problem, by
capturing the missing metadata of "identifiers" and "schema".
- you can always check whether two database objects share the same
pygr.Data ID; just look at their _persistent_id attributes. If the
attribute is missing, the object did not come from pygr.Data.
Specific comments below.
I hope this helps...
-- Chris
On Jan 28, 2009, at 3:04 AM, Kenny Daily wrote:
>
> I am having trouble looking up sequence intervals in NLMSAs, using
> pygr 0.7.1. I am creating SeqPaths by blasting a BlastDB, and pickling
> these objects. Then, when I use them later for indexing into an NLMSA
> created with the same BlastDB, I get <type 'exceptions.KeyError'>:
> 'seq not in PrefixUnionDict'. I'm not even really sure how to track
> down the cause of this problem. As far as I can tell, the problem is
> with the getName() method of the NLMSA's seqDict and the path.db of
> the SeqPath object. If I re-create the SeqPath, then it works fine
> (seen in 56-58 below). Here's the events that lead me to this belief,
> not sure what the corrective action is though.
>
> In [52]: c.sequence
> Out[52]: chr7[2350:2366]
here you've already got the BlastDB open, and c.sequence.path is one
of its member sequences...
>
>
> In [53]: trna = pygr.Data.getResource
> ("Bio.Annotation.YEAST.sacCer.SGD_features.tRNA")
here you open an NLMSA that presumably references one or more
BlastDBs. If these BlastDBs were stored with pygr.Data IDs, and also
c.sequence.path.db above also was opened with a pygr.Data ID, then you
can indeed search trna with intervals of c.sequence.path.db. But
otherwise they will just be different BlastDB objects, and you'll get
a KeyError as shown in your next step:
>
> In [55]: trna.seqDict.getName(c.sequence)
> ---------------------------------------------------------------------------
> <type 'exceptions.KeyError'> Traceback (most recent call
> last)
>
> /home/baldig/projects/genomics/nonsvn/results/yeast/Ty3/BLAST/
> 20090122/
> clustering/20090122/<ipython console> in <module>()
>
> /home/dock/shared_libraries/lx64/pythonpkgs/2.5.1/pygr_0_7_1/pygr/
> seqdb.py in getName(self, path)
> 1414 "return fully qualified ID i.e. 'foo.bar'"
> 1415 path=path.pathForward
> -> 1416 return self.dicts[path.db]+self.separator+path.id
> 1417
> 1418 def newMemberDict(self,**kwargs):
>
>
>
>
> In [56]: genome = b.pygr.Data.getResource
> ("Bio.Seq.Genome.YEAST.sacCer")
here you are opening the BlastDB with its pygr.Data ID
>
>
> In [57]: s = genome[c.sequence.id][c.sequence.start:c.sequence.stop]
>
> In [58]: trna.seqDict.getName(s)
> Out[58]: 'sacCer.chr7'
this proves that c.sequence.path.db was not stored using the pygr.Data
ID. Otherwise, s and c.sequence would behave identically, because
s.path.db and c.sequence.path.db would be guaranteed to be the same
Python object.
It also shows that the NLMA trna was stored with sequence databases
that have pygr.Data IDs. Otherwise this query wouldn't work. It's
the pygr.Data ID that unites disparate references to "the same dataset".
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---