Hi,
Marek and I seem to have solved what appears to be a serious  
performance problem in MySQLdb that affects iteration over large  
numbers of rows:

- using a standard MySQLdb cursor, the initial  
cursor.execute('SELECT ...') will actually load ALL the rows into  
memory on the client side, consuming vast amounts of memory ... even  
though you haven't even called cursor.fetch to tell it to fetch any  
rows!  If you're interested, see:

http://jjinux.blogspot.com/2007/01/python-dealing-with-huge-data-sets-in.html
http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python

- actually requesting rows using cursor.fetch... will then transfer  
the row data from the server a second time, apparently.

- using a SSCursor, the initial cursor.execute('SELECT ...') is fast,  
and so is a subsequent cursor.fetchmany() of 1024 rows.

- BUT cursor.close(), or cursor.nextset(), or even just quitting the  
Python interpreter then *hangs* in the most peculiar way, at least on  
our test platform: it sits there allocating more and more RAM, just as  
if it were loading ALL the rows into memory like above.  Who knows  
what it's really doing.  The main thing is it will allocate multiple  
GB of RAM in Namshin's example, and simply hang forever.  As you can  
see in our example test script below (which imports no Pygr code),  
this is strictly a MySQLdb issue.

These two problems would seem to make horrible performance unavoidable  
for any operation that SELECTs more rows than you can fit in memory,  
but we've examined a couple workarounds:

- using LIMIT OFFSET, NUMBER over and over again will not solve the  
problem, because this approach apparently has O(N**2) time complexity.

- however, we came up with an elegant workaround using a combination  
of WHERE and LIMIT clauses.  See my sscursor branch for this  
implementation; look at BlockGenerator and  
MySQLServerInfo.iter_keys().  It runs Marek's test script  
(annodbtest.py) in less than a second, allocating about 8 MB of RAM.

Marek, please test this yourself.

Everyone else, I'd be grateful if you could give your feedback about  
this workaround. I'm proposing this for inclusion in the 0.8 release  
because the current MySQLdb performance is really unacceptable without  
the workaround.

A little more work is needed to generalize this for the case where the  
user supplies an arbitrary orderBy clause (the solution: in that case  
the user must also supply a whereClause for this slicing and the  
associated list of column names for this whereClause).

Thanks!

-- Chris


Marek's basic_test.py script:
#!/usr/bin/python -i

import MySQLdb
from MySQLdb import cursors

conn = MySQLdb.connect(user="foo", passwd="",
               db="PYGR_MEGATESTS",
               cursorclass=cursors.SSCursor)

c = conn.cursor()
c.execute('select snp_id from pygr_snp126_hg18')
rows = c.fetchmany(1024)
c.close() # this hangs forever, using up more & more RAM...





--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to