Hi,
Marek and I seem to have solved what appears to be a serious
performance problem in MySQLdb that affects iteration over large
numbers of rows:
- using a standard MySQLdb cursor, the initial
cursor.execute('SELECT ...') will actually load ALL the rows into
memory on the client side, consuming vast amounts of memory ... even
though you haven't even called cursor.fetch to tell it to fetch any
rows! If you're interested, see:
http://jjinux.blogspot.com/2007/01/python-dealing-with-huge-data-sets-in.html
http://stackoverflow.com/questions/337479/how-to-get-a-row-by-row-mysql-resultset-in-python
- actually requesting rows using cursor.fetch... will then transfer
the row data from the server a second time, apparently.
- using a SSCursor, the initial cursor.execute('SELECT ...') is fast,
and so is a subsequent cursor.fetchmany() of 1024 rows.
- BUT cursor.close(), or cursor.nextset(), or even just quitting the
Python interpreter then *hangs* in the most peculiar way, at least on
our test platform: it sits there allocating more and more RAM, just as
if it were loading ALL the rows into memory like above. Who knows
what it's really doing. The main thing is it will allocate multiple
GB of RAM in Namshin's example, and simply hang forever. As you can
see in our example test script below (which imports no Pygr code),
this is strictly a MySQLdb issue.
These two problems would seem to make horrible performance unavoidable
for any operation that SELECTs more rows than you can fit in memory,
but we've examined a couple workarounds:
- using LIMIT OFFSET, NUMBER over and over again will not solve the
problem, because this approach apparently has O(N**2) time complexity.
- however, we came up with an elegant workaround using a combination
of WHERE and LIMIT clauses. See my sscursor branch for this
implementation; look at BlockGenerator and
MySQLServerInfo.iter_keys(). It runs Marek's test script
(annodbtest.py) in less than a second, allocating about 8 MB of RAM.
Marek, please test this yourself.
Everyone else, I'd be grateful if you could give your feedback about
this workaround. I'm proposing this for inclusion in the 0.8 release
because the current MySQLdb performance is really unacceptable without
the workaround.
A little more work is needed to generalize this for the case where the
user supplies an arbitrary orderBy clause (the solution: in that case
the user must also supply a whereClause for this slicing and the
associated list of column names for this whereClause).
Thanks!
-- Chris
Marek's basic_test.py script:
#!/usr/bin/python -i
import MySQLdb
from MySQLdb import cursors
conn = MySQLdb.connect(user="foo", passwd="",
db="PYGR_MEGATESTS",
cursorclass=cursors.SSCursor)
c = conn.cursor()
c.execute('select snp_id from pygr_snp126_hg18')
rows = c.fetchmany(1024)
c.close() # this hangs forever, using up more & more RAM...
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---