Re: problems using sequence slices as index to NLMSA

Kenny Daily Wed, 28 Jan 2009 00:23:46 -0800

OK. These things make sense. However, I think what I'm doing is a
little more complicated, and I've left out some of the important steps
that may help explain. First, I'm sure that I'm using the pygr.Data
object everytime...i.e. genome is always set by:


genome = pygr.Data.getResource("Bio.Seq.Genome.YEAST.sacCer")

However, I have an intermediate step that is likely what is causing
the problem. I take the blast results and make an NLMSA in memory for
temporary usage. Really the only reason I do this is because there are
attributes about the BLAST sequences that I want to store which can't
be added directly to sequence slices. Even here, when making the
intermediate AnnotationDB, I use the genome as above. After creating
this NLMSA, I loop through it to get all of the annotations back out
as a list with the attributes I've added intact, as well as their
"sequence" attribute as a by product of making the NLMSA (I need them
sorted by the order I inserted them in). So, its really these objects
that I'm pickling, and then using their sequence attribute to index
into a different NLMSA (trna, as mentioned before) that is stored on
disk (built also using the same genome). I figured this easier than
making my own separate objects with a sequence attribute, as I have
abstracted the creation of AnnotationDBs and NLMSAs to read in
different data types (SGD, Refseq, Illumina, etc.). Could the
temporary NLMSA that I'm turning back into a list be causing any
issues? I will triple check that all creation of NLMSAs, both on disk
and in memory, are using the same genome from pygr.Data. Thanks again
for the explanation!

On Jan 27, 10:12 pm, Christopher Lee <[email protected]> wrote:
> Hi Kenny,
> I think I can explain this.
> - it sounds like you are opening the same sequence database file  
> twice, creating two different BlastDB objects.  Pygr will not consider  
> these "equal" (either by cmp() or by dict hash index), so you get the  
> KeyError you reported.
>
> - This may seem counterintuitive, but it's a data integrity guarantee  
> -- unless they are the same object it is quite challenging to  
> guarantee that they are "the same data".  For example, for a BlastDB,  
> just comparing the filenames is not adequate; that can give both false  
> positive and false negative errors.  Just because the filename string  
> is identical doesn't mean the data are equal; data could have changed  
> on-disk in between the two file opening events.  On the other hand two  
> different filename strings might actually resolve to the same path.  
> For that matter, completely different filenames might point to two  
> files that are exact copies of each other.  To do the check right,  
> you'd have to recompute a hash on the entire dataset.
>
> - pygr's solution to this problem is pygr.Data, i.e. allocate a single  
> unique ID to a resource, and always obtain it by requesting this  
> unique ID.  pygr.Data guarantees that within an interpreter session  
> two different requests for the same ID will resolve to the same Python  
> object.
>
> - to take advantage of this, you just need to be consistent in always  
> identifying a given resource with its pygr.Data ID.  In your example,  
> I suspect you did something like the following: 1) you opened the file  
> by constructing a BlastDB, then used it to generate some blast  
> results, which you then pickled.  This works, but it has no pygr.Data  
> ID, so there's no way for pygr.Data to apply the above data integrity  
> guarantee to it.  2) you then opened the file again (either by  
> constructing a BlastDB again, or getting it from pygr.Data) and you  
> tried to look up results from one in the other (presumably through the  
> intermediary of an NLMSA).  This gave a KeyError.
>
> - this just reflects a fundamental problem: imagine all the analyses  
> and data being done all over the world as a huge "web" of cross-
> references.  In other words, many of those different analyses are  
> actually being done on the same dataset; you can think of that as a  
> "node" that ties together many different analyses being done all over  
> the world.  But ordinarily it would be very hard to run queries that  
> could traverse all these analyses, because the "metadata" that show  
> which data are actually the *same data* are not available.  So the  
> "web of data" fragments into single, disconnected analyses that a  
> computer program cannot query in a way that traverses more than one  
> analysis at a time.  pygr.Data tries to solve this basic problem, by  
> capturing the missing metadata of "identifiers" and "schema".
>
> - you can always check whether two database objects share the same  
> pygr.Data ID; just look at their _persistent_id attributes.  If the  
> attribute is missing, the object did not come from pygr.Data.
>
> Specific comments below.
>
> I hope this helps...
>
> -- Chris
>
> On Jan 28, 2009, at 3:04 AM, Kenny Daily wrote:
>
>
>
> > I am having trouble looking up sequence intervals in NLMSAs, using
> > pygr 0.7.1. I am creating SeqPaths by blasting a BlastDB, and pickling
> > these objects. Then, when I use them later for indexing into an NLMSA
> > created with the same BlastDB, I get <type 'exceptions.KeyError'>:
> > 'seq not in PrefixUnionDict'. I'm not even really sure how to track
> > down the cause of this problem. As far as I can tell, the problem is
> > with the getName() method of the NLMSA's seqDict and the path.db of
> > the SeqPath object. If I re-create the SeqPath, then it works fine
> > (seen in 56-58 below). Here's the events that lead me to this belief,
> > not sure what the corrective action is though.
>
> > In [52]: c.sequence
> > Out[52]: chr7[2350:2366]
>
> here you've already got the BlastDB open, and c.sequence.path is one  
> of its member sequences...
>
>
>
> > In [53]: trna = pygr.Data.getResource
> > ("Bio.Annotation.YEAST.sacCer.SGD_features.tRNA")
>
> here you open an NLMSA that presumably references one or more  
> BlastDBs.  If these BlastDBs were stored with pygr.Data IDs, and also  
> c.sequence.path.db above also was opened with a pygr.Data ID, then you  
> can indeed search trna with intervals of c.sequence.path.db.  But  
> otherwise they will just be different BlastDB objects, and you'll get  
> a KeyError as shown in your next step:
>
>
>
>
>
> > In [55]: trna.seqDict.getName(c.sequence)
> > ---------------------------------------------------------------------------
> > <type 'exceptions.KeyError'>              Traceback (most recent call
> > last)
>
> > /home/baldig/projects/genomics/nonsvn/results/yeast/Ty3/BLAST/
> > 20090122/
> > clustering/20090122/<ipython console> in <module>()
>
> > /home/dock/shared_libraries/lx64/pythonpkgs/2.5.1/pygr_0_7_1/pygr/
> > seqdb.py in getName(self, path)
> >   1414         "return fully qualified ID i.e. 'foo.bar'"
> >   1415         path=path.pathForward
> > -> 1416         return self.dicts[path.db]+self.separator+path.id
> >   1417
> >   1418     def newMemberDict(self,**kwargs):
>
> > In [56]: genome = b.pygr.Data.getResource
> > ("Bio.Seq.Genome.YEAST.sacCer")
>
> here you are opening the BlastDB with its pygr.Data ID
>
>
>
> > In [57]: s = genome[c.sequence.id][c.sequence.start:c.sequence.stop]
>
> > In [58]: trna.seqDict.getName(s)
> > Out[58]: 'sacCer.chr7'
>
> this proves that c.sequence.path.db was not stored using the pygr.Data  
> ID.  Otherwise, s and c.sequence would behave identically, because  
> s.path.db and c.sequence.path.db would be guaranteed to be the same  
> Python object.
>
> It also shows that the NLMA trna was stored with sequence databases  
> that have pygr.Data IDs.  Otherwise this query wouldn't work.  It's  
> the pygr.Data ID that unites disparate references to "the same dataset".
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: problems using sequence slices as index to NLMSA

Reply via email to