Yesterday Jenny and I discussed a variety of issues about pygr.sqlgraph and sqlalchemy, which seem relevant to this thread...
-- Chris Jenny Qian 1/6/09 12:28 PM Hi Chris, I've been studying the sqlgraph module ... Christopher Lee 1/6/09 12:28 PM hi Jenny Qian 1/6/09 12:29 PM How do you like Istvan's proposal of making sa part of pygr's dependency? Christopher Lee 1/6/09 12:30 PM I'd have to analyze sa's capabilities very carefully. I suspect it lacks a lot of what we need, but it would be great if it supplied everything we needed... 1/6/09 12:30 PM performance is a major issue Jenny Qian 1/6/09 12:32 PM SQLALchemy seems to have more tedious interface than SQLTable. Christopher Lee 1/6/09 12:34 PM if you've already studied sa a bit, maybe we should together analyze whether it will fulfill pygr's needs. Jenny Qian 1/6/09 12:35 PM One thing about sqlgraph, is it true is can only access database info, but not update database tables? Christopher Lee 1/6/09 12:37 PM sqlgraph has been used almost exclusively in read-only mode, but I have a branch where I added INSERT and UPDATE support. We could merge that to the master branch. Jenny Qian 1/6/09 12:39 PM sqlgraph only supports mySQL? Christopher Lee 1/6/09 12:43 PM it uses generic SQL, so it should work with any database. It's only been *tested* with mysql, so there will probably turn out to be a few places where we need to adjust the SQL to ensure compatibility. The one hole in the python DB API 2.0 is that it provides no specification for retrieving schema information. That means there just *isn't* any generic way to request schema information from a database. This is very annoying. We'd have to code different methods for getting schema information from different database platforms such as mysql, postgres, sqlite etc. But thankfully this is a relatively small and isolated function, so that's probably not too hard. Jenny Qian 1/6/09 12:44 PM sorry, schema means the inter-relations between tables? Christopher Lee 1/6/09 12:51 PM For sqlgraph, at a minimum it needs to read all the column names for a table and automatically determine which is the primary key. But looking forward to the future, yes, I'd like it to automatically infer the ER diagram (foreign key relations between tables). Something like that could greatly facilitate connecting to a new database like Ensembl! Pygr has some fairly simple code to do this, which I've used with my lab's databases, but I don't think anyone else has used that code... Jenny Qian 1/6/09 12:53 PM Can you show me the code? 1/6/09 12:56 PM What I really liked about sqlgraph is the use of python's weakref module. My understanding is that the combination of python's weakref.WeakValueDictionary and the SQL select based on the dbID could be powerful in generally improving the performance. Is that right? Christopher Lee 1/6/09 12:59 PM Yes. We've had to address many performance / memory usage issues because of the very large datasets we work with. Namshin Kim kept running into problems where the performance was inadequate either due to accessing data row by row or using up huge amounts of memory resulting in virtual memory swapping. We solved these problems in several ways: 1/6/09 1:04 PM - implicit control of how data is loaded: ordinarily, you don't want to load an entire table into local memory. So standard iteration (i.e. __iter__) only gets a list of the primary keys; the actual data will only be loaded if the user requests the object corresponding to a given key. And as you pointed out, as soon as a user drops a reference to an object, the WeakValueDictionary will automatically flush it from memory, keeping memory usage low. By contrast, if the user actually wants all the data loaded in local memory, they can call the items() method, which will load the whole table with a single SELECT call, for maximum efficiency. Thus, the user has a fairly intuitive way to choose row-by-row loading (the default) vs. efficient whole-table loading (using items()). Jenny Qian 1/6/09 1:07 PM I found that out by studying the sqlgraph as well. This is very cool. Christopher Lee 1/6/09 1:09 PM - clusterKey: many kinds of data have some kind of grouping structure, for example, in a table storing exon information, we almost never need to look simultaneously at two exons from different genes. I.e. exons form a gene structure, and we typically just want all the exons from one gene at a time. So we can greatly speed up data retrieval by loading simultaneously all rows that share the same clusterKey value. If there are an average of 20 exons per gene, that would speed up loading of data from the server by a factor of 20... Jenny Qian 1/6/09 1:11 PM I also read a bit about the sa's way to handle eager/lazy loading, it is more demanding on user's knowledge/awareness about *performance*. 1/6/09 1:12 PM and it is less specific to the nature of biological data. 1/6/09 1:23 PM gsc is trying to lay a sqlalchemy ORM between the solexa database and the zope website... 1/6/09 1:25 PM I will read more about the eager/lazy loading of SA. Just to see what they can offer... 1/6/09 1:26 PM Currently, we don't have integration between SA and pygr.Data yet, right? Christopher Lee 1/6/09 1:28 PM no, no such integration Jenny Qian 1/6/09 1:29 PM Can sqlgraph objects be easily submitted to zope? 1/6/09 1:30 PM How difficult is it to provide connectives between pygr's ORM and Zope? Christopher Lee 1/6/09 1:34 PM zope is python. So presumably zope can be made to work with anything that has a python interface such as sqlalchemy or pygr... Jenny Qian 1/6/09 1:50 PM currently, the integration between sa and zope is already available... Christopher Lee 1/6/09 1:54 PM it would be useful to make a table comparing sqlalchemy and sqlgraph side by side... Jenny Qian 1/6/09 1:56 PM you are right. I may try this 1/6/09 2:12 PM Hi Chris, can you be more specific about what you need, so when I am studying SA and drawing the comparison tables, I could be more specific as well? Christopher Lee 1/6/09 2:13 PM performance features are an important area; all the performance issues that we discussed above. Jenny Qian 1/6/09 2:14 PM ic Christopher Lee 1/6/09 2:14 PM interface issues: sqlgraph follows a strict dict-like interface, with the primary key as dictionary key. What does sqlalchemy do? Jenny Qian 1/6/09 2:15 PM ok Christopher Lee 1/6/09 2:19 PM class issues: sqlgraph uses itemClass and itemSliceClass parameters to let the user control what class will be used to construct row objects, or slices of row objects (a concept which probably doesn't exist in sqlalchemy). sqlgraph uses "subclass binding" to customize a generic row class to efficiently access object attributes. What does sqlalchemy do? 1/6/09 2:20 PM pickling issues: can sqlalchemy table or row objects be pickled and unpickled? Will they work in pygr.Data? (i.e. automatically reconnect to the correct server, as sqlgraph does?) Jenny Qian 1/6/09 2:22 PM *pickling* is for sure an advantage of sqlgraph objects. 1/6/09 2:25 PM In other words, they don't provide a clean interface as pygr.Data does. User has to re-run the script to require a database connection explicitly. Christopher Lee 1/6/09 2:26 PM memory usage issues: sqlgraph uses a variety of tricks to minimize memory usage. For example, the default row object stores each row as a Python tuple, eliminating the need to store a __dict__ for each row object. It also provides another mechanism (SQLRow) that only loads attribute data when users ask for it; this is used for SQLSequence, which provides virtual sequence access (automatic slicing) without having to load a whole sequence string into memory. That is crucial for applications that store an entire genome, chromosome or other large sequence as a database row... Jenny Qian 1/6/09 2:27 PM wow. Christopher Lee 1/6/09 2:28 PM we can work together on this comparison, once you put together an initial list of points... Jenny Qian 1/6/09 2:28 PM But sa does seem to have a few advantages as well like simplifying sophisticated joins and subqueries. Christopher Lee 1/6/09 2:28 PM that would be great to learn about Jenny Qian 1/6/09 2:28 PM sounds good. 1/6/09 2:30 PM Is strict dict-like interface really an advantage? Christopher Lee 1/6/09 2:32 PM i suspect that pygr.Data compatibility is the most important question we'll have to look at. An object-relational interface aims to provide an object interface to persistent data. The purpose of pygr.Data is to make database persistence totally transparent, at both the schema and code level. A traditional ORM like sqlalchemy only goes half way towards this goal of transparent persistence; you can't just say "I want the data named Bio.Genome.Foo" and start working... Jenny Qian 1/6/09 2:33 PM exactly, I totally agree. Christopher Lee 1/6/09 2:34 PM Yes, I think a dict-like interface is a natural Python model for any dataset with a primary key. Python programmers know how to use dict interfaces; why should they have to learn something more complicated to do the same thing? Jenny Qian 1/6/09 2:35 PM but then you don't always know the dbID when you want to retrieve a row... Often you don't. 1/6/09 2:36 PM instead you know the common name for a gene for instance. Christopher Lee 1/6/09 2:36 PM In that case SELECT seems like the right interface; again, it's something everybody knows... Jenny Qian 1/6/09 2:36 PM so the where clause will be more useful in practice. 1/6/09 2:36 PM ic Christopher Lee 1/6/09 2:37 PM sqltable has a select method... Jenny Qian 1/6/09 2:37 PM still pygr is not supporting 'order by' 1/6/09 2:37 PM I know. 1/6/09 2:37 PM the select method in the SQLTable 1/6/09 2:38 PM We use 'Order by' so often... Christopher Lee 1/6/09 2:39 PM actually orderBy is a supported parameter for SQLTable and its subclasses... Jenny Qian 1/6/09 2:39 PM hmm... Christopher Lee 1/6/09 2:39 PM added recently, don't remember when Jenny Qian 1/6/09 2:40 PM hehe Christopher Lee 1/6/09 2:40 PM needs to be tested systematically Jenny Qian 1/6/09 2:41 PM why do we need to separate SQLTable and SQLTableBase? 1/6/09 2:42 PM Are all the tests related to sqlgraph.py in the sqltable_test.py and graph_test.py? Christopher Lee 1/6/09 2:45 PM SQLTableBase is the base class for many flavors of access: SQLTable (data get loaded as row objects); SQLTableClustered (uses clusterKey as I described above); SQLTableNoCache (row objects do not load the data locally, but access it from the database on the fly), SQLTableMultiNoCache (provides a one-to-many interface suitable for a foreign key), and a number of others. Jenny Qian 1/6/09 2:46 PM right, now I remember again ^.^ Christopher Lee 1/6/09 2:46 PM sqlsequence_test.py as well Jenny Qian 1/6/09 2:46 PM ok --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
