Re: Stealing, err, "borrowing" from sqlalchemy

Christopher Lee Wed, 07 Jan 2009 11:44:14 -0800

Yesterday Jenny and I discussed a variety of issues about  
pygr.sqlgraph and sqlalchemy, which seem relevant to this thread...


-- Chris

Jenny Qian
1/6/09 12:28 PM
Hi Chris, I've been studying the sqlgraph module ...
Christopher Lee
1/6/09 12:28 PM
hi
Jenny Qian
1/6/09 12:29 PM
How do you like Istvan's proposal of making sa part of pygr's  
dependency?
Christopher Lee
1/6/09 12:30 PM
I'd have to analyze sa's capabilities very carefully.  I suspect it  
lacks a lot of what we need, but it would be great if it supplied  
everything we needed...
1/6/09 12:30 PM

performance is a major issue
Jenny Qian
1/6/09 12:32 PM
SQLALchemy seems to have more tedious interface than SQLTable.
Christopher Lee
1/6/09 12:34 PM
if you've already studied sa a bit, maybe we should together analyze  
whether it will fulfill pygr's needs.
Jenny Qian
1/6/09 12:35 PM
One thing about sqlgraph, is it true is can only access database info,  
but not update  database tables?
Christopher Lee
1/6/09 12:37 PM
sqlgraph has been used almost exclusively in read-only mode, but I  
have a branch where I added INSERT and UPDATE support.  We could merge  
that to the master branch.
Jenny Qian
1/6/09 12:39 PM
sqlgraph only supports mySQL?
Christopher Lee
1/6/09 12:43 PM
it uses generic SQL, so it should work with any database.  It's only  
been *tested* with mysql, so there will probably turn out to be a few  
places where we need to adjust the SQL to ensure compatibility.  The  
one hole in the python DB API 2.0 is that it provides no specification  
for retrieving schema information.  That means there just *isn't* any  
generic way to request schema information from a database.  This is  
very annoying.  We'd have to code different methods for getting schema  
information from different database platforms such as mysql, postgres,  
sqlite etc.  But thankfully this is a relatively small and isolated  
function, so that's probably not too hard.
Jenny Qian
1/6/09 12:44 PM
sorry, schema means the inter-relations between tables?
Christopher Lee
1/6/09 12:51 PM
For sqlgraph, at a minimum it needs to read all the column names for a  
table and automatically determine which is the primary key.  But  
looking forward to the future, yes, I'd like it to automatically infer  
the ER diagram (foreign key relations between tables).  Something like  
that could greatly facilitate connecting to a new database like  
Ensembl!  Pygr has some fairly simple code to do this, which I've used  
with my lab's databases, but I don't think anyone else has used that  
code...
Jenny Qian
1/6/09 12:53 PM
Can you show me the code?
1/6/09 12:56 PM

What I really liked about sqlgraph is the use of python's weakref  
module.  My understanding is that the combination of python's  
weakref.WeakValueDictionary and the SQL select based on the dbID could  
be powerful in generally improving the performance.  Is that right?
Christopher Lee
1/6/09 12:59 PM
Yes. We've had to address many performance / memory usage issues  
because of the very large datasets we work with.  Namshin Kim kept  
running into problems where the performance was inadequate either due  
to accessing data row by row or using up huge amounts of memory  
resulting in virtual memory swapping.  We solved these problems in  
several ways:
1/6/09 1:04 PM

- implicit control of how data is loaded: ordinarily, you don't want  
to load an entire table into local memory.  So standard iteration  
(i.e. __iter__) only gets a list of the primary keys; the actual data  
will only be loaded if the user requests the object corresponding to a  
given key.  And as you pointed out, as soon as a user drops a  
reference to an object, the WeakValueDictionary will automatically  
flush it from memory, keeping memory usage low.  By contrast, if the  
user actually wants all the data loaded in local memory, they can call  
the items() method, which will load the whole table with a single  
SELECT call, for maximum efficiency.  Thus, the user has a fairly  
intuitive way to choose row-by-row loading (the default) vs. efficient  
whole-table loading (using items()).
Jenny Qian
1/6/09 1:07 PM
I found that out by studying the sqlgraph as well.  This is very cool.
Christopher Lee
1/6/09 1:09 PM
- clusterKey: many kinds of data have some kind of grouping structure,  
for example, in a table storing exon information, we almost never need  
to look simultaneously at two exons from different genes.  I.e. exons  
form a gene structure, and we typically just want all the exons from  
one gene at a time.  So we can greatly speed up data retrieval by  
loading simultaneously all rows that share the same clusterKey value.   
If there are an average of 20 exons per gene, that would speed up  
loading of data from the server by a factor of 20...
Jenny Qian
1/6/09 1:11 PM
I also read a bit about the sa's way to handle eager/lazy loading, it  
is more demanding on user's knowledge/awareness about  *performance*.
1/6/09 1:12 PM

and it is less specific to the nature of biological data.
1/6/09 1:23 PM

gsc is trying to lay a sqlalchemy ORM between the solexa database and  
the zope website...
1/6/09 1:25 PM

I will read more about the eager/lazy loading of SA.  Just to see what  
they can offer...
1/6/09 1:26 PM

Currently, we don't have integration between SA and pygr.Data yet,  
right?
Christopher Lee
1/6/09 1:28 PM
no, no such integration
Jenny Qian
1/6/09 1:29 PM
Can sqlgraph objects be easily submitted to zope?
1/6/09 1:30 PM

How difficult is it to provide connectives between pygr's ORM and Zope?
Christopher Lee
1/6/09 1:34 PM

zope is python.  So presumably zope can be made to work with anything  
that has a python interface such as sqlalchemy or pygr...
Jenny Qian
1/6/09 1:50 PM

currently, the integration between sa and zope is already available...
Christopher Lee
1/6/09 1:54 PM
it would be useful to make a table comparing sqlalchemy and sqlgraph  
side by side...
Jenny Qian
1/6/09 1:56 PM
you are right.  I may try this
1/6/09 2:12 PM

Hi Chris, can you be more specific about what you need, so when I am  
studying SA and drawing the comparison tables, I could be more  
specific as well?
Christopher Lee
1/6/09 2:13 PM
performance features are an important area; all the performance issues  
that we discussed above.
Jenny Qian
1/6/09 2:14 PM
ic
Christopher Lee
1/6/09 2:14 PM
interface issues: sqlgraph follows a strict dict-like interface, with  
the primary key as dictionary key.  What does sqlalchemy do?
Jenny Qian
1/6/09 2:15 PM
ok
Christopher Lee
1/6/09 2:19 PM
class issues: sqlgraph uses itemClass and itemSliceClass parameters to  
let the user control what class will be used to construct row objects,  
or slices of row objects (a concept which probably doesn't exist in  
sqlalchemy).  sqlgraph uses "subclass binding" to customize a generic  
row class to efficiently access object attributes.  What does  
sqlalchemy do?
1/6/09 2:20 PM

pickling issues: can sqlalchemy table or row objects be pickled and  
unpickled?  Will they work in pygr.Data? (i.e. automatically reconnect  
to the correct server, as sqlgraph does?)
Jenny Qian
1/6/09 2:22 PM
*pickling* is for sure an advantage of sqlgraph objects.
1/6/09 2:25 PM

In other words, they don't provide a clean interface as pygr.Data  
does.  User has to re-run the script to require a database connection  
explicitly.
Christopher Lee
1/6/09 2:26 PM
memory usage issues: sqlgraph uses a variety of tricks to minimize  
memory usage.  For example, the default row object stores each row as  
a Python tuple, eliminating the need to store a __dict__ for each row  
object.  It also provides another mechanism (SQLRow) that only loads  
attribute data when users ask for it; this is used for SQLSequence,  
which provides virtual sequence access (automatic slicing) without  
having to load a whole sequence string into memory.  That is crucial  
for applications that store an entire genome, chromosome or other  
large sequence as a database row...
Jenny Qian
1/6/09 2:27 PM
wow.
Christopher Lee
1/6/09 2:28 PM
we can work together on this comparison, once you put together an  
initial list of points...
Jenny Qian
1/6/09 2:28 PM
But sa does seem to have a few advantages as well like simplifying  
sophisticated joins and subqueries.
Christopher Lee
1/6/09 2:28 PM
that would be great to learn about
Jenny Qian
1/6/09 2:28 PM
sounds good.
1/6/09 2:30 PM

Is strict dict-like interface really an advantage?
Christopher Lee
1/6/09 2:32 PM
i suspect that pygr.Data compatibility is the most important question  
we'll have to look at.  An object-relational interface aims to provide  
an object interface to persistent data.  The purpose of pygr.Data is  
to make database persistence totally transparent, at both the schema  
and code level.  A traditional ORM like sqlalchemy only goes half way  
towards this goal of transparent persistence; you can't just say "I  
want the data named Bio.Genome.Foo" and start working...
Jenny Qian
1/6/09 2:33 PM
exactly, I totally agree.
Christopher Lee
1/6/09 2:34 PM
Yes, I think a dict-like interface is a natural Python model for any  
dataset with a primary key.  Python programmers know how to use dict  
interfaces; why should they have to learn something more complicated  
to do the same thing?
Jenny Qian
1/6/09 2:35 PM
but then you don't always know the dbID when you want to retrieve a  
row... Often you don't.
1/6/09 2:36 PM

instead you know the common name for a gene for instance.
Christopher Lee
1/6/09 2:36 PM
In that case SELECT seems like the right interface; again, it's  
something everybody knows...
Jenny Qian
1/6/09 2:36 PM
so the where clause will be more useful in practice.
1/6/09 2:36 PM

ic
Christopher Lee
1/6/09 2:37 PM
sqltable has a select method...
Jenny Qian
1/6/09 2:37 PM
still pygr is not supporting 'order by'
1/6/09 2:37 PM

I know.
1/6/09 2:37 PM

the select method in the SQLTable
1/6/09 2:38 PM

We use 'Order by' so often...
Christopher Lee
1/6/09 2:39 PM
actually orderBy is a supported parameter for SQLTable and its  
subclasses...
Jenny Qian
1/6/09 2:39 PM
hmm...
Christopher Lee
1/6/09 2:39 PM
added recently, don't remember when
Jenny Qian
1/6/09 2:40 PM
hehe
Christopher Lee
1/6/09 2:40 PM
needs to be tested systematically
Jenny Qian
1/6/09 2:41 PM
why do we need to separate SQLTable and SQLTableBase?
1/6/09 2:42 PM

Are all the tests related to sqlgraph.py in the sqltable_test.py and  
graph_test.py?
Christopher Lee
1/6/09 2:45 PM
SQLTableBase is the base class for many flavors of access: SQLTable  
(data get loaded as row objects); SQLTableClustered (uses clusterKey  
as I described above); SQLTableNoCache (row objects do not load the  
data locally, but access it from the database on the fly),  
SQLTableMultiNoCache (provides a one-to-many interface suitable for a  
foreign key), and a number of others.
Jenny Qian
1/6/09 2:46 PM
right, now I remember again ^.^
Christopher Lee
1/6/09 2:46 PM
sqlsequence_test.py as well
Jenny Qian
1/6/09 2:46 PM
ok

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Stealing, err, "borrowing" from sqlalchemy

Reply via email to