Re: problems using sequence slices as index to NLMSA

Christopher Lee Tue, 27 Jan 2009 22:13:14 -0800

Hi Kenny,
I think I can explain this.
- it sounds like you are opening the same sequence database file  
twice, creating two different BlastDB objects.  Pygr will not consider  
these "equal" (either by cmp() or by dict hash index), so you get the  
KeyError you reported.


- This may seem counterintuitive, but it's a data integrity guarantee  
-- unless they are the same object it is quite challenging to  
guarantee that they are "the same data".  For example, for a BlastDB,  
just comparing the filenames is not adequate; that can give both false  
positive and false negative errors.  Just because the filename string  
is identical doesn't mean the data are equal; data could have changed  
on-disk in between the two file opening events.  On the other hand two  
different filename strings might actually resolve to the same path.   
For that matter, completely different filenames might point to two  
files that are exact copies of each other.  To do the check right,  
you'd have to recompute a hash on the entire dataset.

- pygr's solution to this problem is pygr.Data, i.e. allocate a single  
unique ID to a resource, and always obtain it by requesting this  
unique ID.  pygr.Data guarantees that within an interpreter session  
two different requests for the same ID will resolve to the same Python  
object.

- to take advantage of this, you just need to be consistent in always  
identifying a given resource with its pygr.Data ID.  In your example,  
I suspect you did something like the following: 1) you opened the file  
by constructing a BlastDB, then used it to generate some blast  
results, which you then pickled.  This works, but it has no pygr.Data  
ID, so there's no way for pygr.Data to apply the above data integrity  
guarantee to it.  2) you then opened the file again (either by  
constructing a BlastDB again, or getting it from pygr.Data) and you  
tried to look up results from one in the other (presumably through the  
intermediary of an NLMSA).  This gave a KeyError.

- this just reflects a fundamental problem: imagine all the analyses  
and data being done all over the world as a huge "web" of cross- 
references.  In other words, many of those different analyses are  
actually being done on the same dataset; you can think of that as a  
"node" that ties together many different analyses being done all over  
the world.  But ordinarily it would be very hard to run queries that  
could traverse all these analyses, because the "metadata" that show  
which data are actually the *same data* are not available.  So the  
"web of data" fragments into single, disconnected analyses that a  
computer program cannot query in a way that traverses more than one  
analysis at a time.  pygr.Data tries to solve this basic problem, by  
capturing the missing metadata of "identifiers" and "schema".

- you can always check whether two database objects share the same  
pygr.Data ID; just look at their _persistent_id attributes.  If the  
attribute is missing, the object did not come from pygr.Data.

Specific comments below.

I hope this helps...

-- Chris


On Jan 28, 2009, at 3:04 AM, Kenny Daily wrote:

>
> I am having trouble looking up sequence intervals in NLMSAs, using
> pygr 0.7.1. I am creating SeqPaths by blasting a BlastDB, and pickling
> these objects. Then, when I use them later for indexing into an NLMSA
> created with the same BlastDB, I get <type 'exceptions.KeyError'>:
> 'seq not in PrefixUnionDict'. I'm not even really sure how to track
> down the cause of this problem. As far as I can tell, the problem is
> with the getName() method of the NLMSA's seqDict and the path.db of
> the SeqPath object. If I re-create the SeqPath, then it works fine
> (seen in 56-58 below). Here's the events that lead me to this belief,
> not sure what the corrective action is though.
>
> In [52]: c.sequence
> Out[52]: chr7[2350:2366]
here you've already got the BlastDB open, and c.sequence.path is one  
of its member sequences...


>
>
> In [53]: trna = pygr.Data.getResource
> ("Bio.Annotation.YEAST.sacCer.SGD_features.tRNA")
here you open an NLMSA that presumably references one or more  
BlastDBs.  If these BlastDBs were stored with pygr.Data IDs, and also  
c.sequence.path.db above also was opened with a pygr.Data ID, then you  
can indeed search trna with intervals of c.sequence.path.db.  But  
otherwise they will just be different BlastDB objects, and you'll get  
a KeyError as shown in your next step:

>
> In [55]: trna.seqDict.getName(c.sequence)
> ---------------------------------------------------------------------------
> <type 'exceptions.KeyError'>              Traceback (most recent call
> last)
>
> /home/baldig/projects/genomics/nonsvn/results/yeast/Ty3/BLAST/ 
> 20090122/
> clustering/20090122/<ipython console> in <module>()
>
> /home/dock/shared_libraries/lx64/pythonpkgs/2.5.1/pygr_0_7_1/pygr/
> seqdb.py in getName(self, path)
>   1414         "return fully qualified ID i.e. 'foo.bar'"
>   1415         path=path.pathForward
> -> 1416         return self.dicts[path.db]+self.separator+path.id
>   1417
>   1418     def newMemberDict(self,**kwargs):
>
>
>
>
> In [56]: genome = b.pygr.Data.getResource
> ("Bio.Seq.Genome.YEAST.sacCer")
here you are opening the BlastDB with its pygr.Data ID

>
>
> In [57]: s = genome[c.sequence.id][c.sequence.start:c.sequence.stop]
>
> In [58]: trna.seqDict.getName(s)
> Out[58]: 'sacCer.chr7'
this proves that c.sequence.path.db was not stored using the pygr.Data  
ID.  Otherwise, s and c.sequence would behave identically, because  
s.path.db and c.sequence.path.db would be guaranteed to be the same  
Python object.

It also shows that the NLMA trna was stored with sequence databases  
that have pygr.Data IDs.  Otherwise this query wouldn't work.  It's  
the pygr.Data ID that unites disparate references to "the same dataset".



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: problems using sequence slices as index to NLMSA

Reply via email to