Re: [Numpy-discussion] mysql -> record array

Francesc Altet Thu, 16 Nov 2006 05:13:52 -0800

A Dimarts 14 Novembre 2006 23:08, Erin Sheldon escrigué:
> On 11/14/06, John Hunter <[EMAIL PROTECTED]> wrote:
> > Has anyone written any code to facilitate dumping mysql query results
> > (mainly arrays of floats) into numpy arrays directly at the extension
> > code layer.  The query results->list->array conversion can be slow.
> >
> > Ideally, one could do this semi-automagically with record arrays and
> > table introspection....
>
> I've been considering this as well.  I use both postgres and Oracle
> in my work, and I have been using the python interfaces (cx_Oracle
> and pgdb) to get result lists and convert to numpy arrays.
>
> The question I have been asking myself is "what is the advantage
> of such an approach?".  It would be faster, but by how
> much?  Presumably the bottleneck for most applications will
> be data retrieval rather than data copying in memory.


Well, that largely depends on your pattern to access the data in your
database. If you are accessing to regions of your database that have a
high degree of spatial locality (i.e. they are located in equal or
very similar places), the data is most probably already in memory (in
your filesystem cache or maybe in your database cache) and the
bottleneck will become the memory access. Of course, if you don't have
such a spatial locality in the access pattern, then the bottleneck
will be the disk.

Just to see how DB 2.0 could benefit from adopting record arrays as
input buffers, I've done a comparison between SQLite3 and PyTables.
PyTables doesn't suport DB 2.0 as such, but it does use record arrays
as buffers internally so as to read data in an efficient way (there
should be other databases that features this, but I know PyTables best
;)

For this, I've used a modified version of a small benchmarking program
posted by Tim Hochberg in this same thread (it is listed at the end
of the message). Here are the results:

setup SQLite took 23.5661110878 seconds
retrieve SQLite took 3.26717996597 seconds
setup PyTables took 0.139157056808 seconds
retrieve PyTables took 0.13444685936 seconds

[SQLite results were obtained using an in-memory database, while
PyTables used an on-disk one. See the code.]

So, yes, if your access pattern exhibits a high degree of locality,
you can expect a huge difference on the reading speed (more than 20x
for this example, but as this depends on the dataset size, it can be
even higher for larger datasets).

> On the other hand, the database access modules for all major
> databases, with DB 2.0 semicomplience, have already been written.
> This is not an insignificant amount of work.  Writing our own
> interfaces for each  of our favorite databases would require an
> equivalent amount of work.

That's true, but still, feasible. However, before people would start
doing this on a general way, it should help implementing first in
Python something like the numpy.ndarray object: this would standarize
a full-fledged heterogeneous buffer for doing intensive I/O tasks.

> I think a set of timing tests would be useful.  I will try some
> using Oracle or postgres over the next few days.  Perhaps
> you could do the same with mysql.

Well, here it is my own benchmark (admittedly trivial). Hope it helps
in your comparisons.

----------------------------------------------------------------------
import sqlite3, numpy as np, time, tables as pt, os, os.path

N = 500000
rndata = np.random.rand(2, N)
dtype = np.dtype([('x',float), ('y', float)])
data = np.empty(shape=N, dtype=dtype)
data['x'] = rndata[0]
data['y'] = rndata[1]

def setupSQLite(conn):
    c = conn.cursor()
    c.execute('''create table demo (x real, y real)''')
    c.executemany("""insert into demo values (?, ?)""", data)

def retrieveSQLite(conn):
    c = conn.cursor()
    c.execute('select * from demo')
    y = np.fromiter(c, dtype=dtype)
    return y

def setupPT(fileh):
    fileh.createTable('/', 'table', data)

def retrievePT(fileh):
    y = fileh.root.table[:]
    return y


# if os.path.exists('test.sql3'):
#     os.remove('test.sql3')
#conn = sqlite3.connect('test.sql3')
conn = sqlite3.connect(':memory:')

t0 = time.time()
setupSQLite(conn)
t1 = time.time()
print "setup SQLite took", t1-t0, "seconds"

t0 = time.time()
y1 = retrieveSQLite(conn)
t1 = time.time()
print "retrieve SQLite took", t1-t0, "seconds"

conn.close()

fileh = pt.openFile("test.h5", "w")

t0 = time.time()
setupPT(fileh)
t1 = time.time()
print "setup PyTables took", t1-t0, "seconds"

t0 = time.time()
y2 = retrievePT(fileh)
t1 = time.time()
print "retrieve PyTables took", t1-t0, "seconds"

fileh.close()

assert y1.shape == y2.shape
assert np.alltrue(y1 == y2)


-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] mysql -> record array

Reply via email to