> On Dec 18, 2014, at 2:00 PM, SF Markus Elfring
> <[email protected]> wrote:
>
>> Can someone please take up this task and prove to me with benchmarks that
>> this effort will be worthwhile
>> (or at least show me what I’m going to have to put into my documentation as
>> to when these new APIs are appropriate) ?
>
> Some database software supports the processing of prepared SQL statements
> by well-known application programming interfaces for decades, doesn't it?
>
> Would you like to reuse any more experiences about execution speed and program
> safety from software development history?
Below is a benchmark script, psycopg2 vs. pg8000. The former uses no prepared
statements whatsoever, the latter uses prepared statements for all executions.
To even make it more “fair” I will illustrate an executemany(), so that pg8000
can definitely take advantage of “caching” this statement at the prepared
statement level.
The output times, for INSERTing 50000 rows, using five batches of 10000 (so
that pg8000 can re-use the same prepared statement 10000 times), is:
Total time for psycopg2: 4.992107
Total time for pg8000: 10.770641
Now as it turns out, this test is entirely unfair. psycopg2 is written in pure
C code and talks to libpq directly, whereas pg8000 is pure Python and invokes
postgresql’s protocol directly. There is no chance of pg8000 ever approaching
the speed of psycopg2.
In my experiences about execution speed in Python, enhancements like being able
to cache prepared statements are dwarfed by the key issues in Python - that of
function call overhead and code complexity. In this case, choosing the DBAPI
because one uses prepared statements and the other does not would not be the
right approach. It doesn’t matter that prepared statements are traditionally
useful. The complexity they would introduce, where programmers take a guess
that using prepared statements in conjunction with the necessary
statement-caching architecture surrounding their use will make their program
faster, will more often than not have a net negative effect. Working with
Openstack developers, I regularly have to shoot down all kinds of changes that
are made in the name of “will (theoretically) perform better”, yet in reality,
when benchmarked, they perform worse, or hardly better at all, at the expense
of greater complexity. The theories are supported by all kinds of blanket
statements such as “iterators are faster”, “using Core statements are faster
than the ORM”, “asynchronous code is faster than synchronous”, which at face
value are true in some circumstances, but in practice in many specific
situations are not at all better from a performance standpoint and only
complicate the code.
Basically I want there to be very good reason for this feature to be an
explicit part of the DBAPI because it is going to give me personally a lot of
extra headaches when people start asking for it in the name of “performance”,
especially since that in practice, psycopg2, pg8000 and perhaps mxODBC will be
the only DBAPIs to add this feature, the MySQL DBAPIs are unlikely to be
interested in this, and the pysqlite API almost certainly won’t as they have
been stalling for years just on very rudimental transactional issues that
remain open. A DBAPI can already choose to make use of cached prepared
statements as an internal optimization, using an LRU cache that could be
configurable via the connect() function, or just via the executemany() API as
some do right now.
In my work with consumers of the DBAPI, prepared statement support is not high
(or even at all) on the list of needs. DBAPIs that can do unicode very well,
support asynchronous APIs such as asyncio as well as implicit environments such
as gevent, have lots of robust datatype support as well as very robust failure
mode handling, that’s what people are looking for (also areas that psycopg2
does extremely well), and these are all areas in which the actual pep-249 is
IMO sorely lacking in guidance and could use more immediate attention first.
import random
import timeit
import psycopg2
import pg8000
def setup(conn):
cursor = conn.cursor()
cursor.execute("drop table if exists data")
cursor.execute(
"create table data (id SERIAL primary key, data VARCHAR(100))")
cursor.close()
def run_test(conn):
cursor = conn.cursor()
cursor.executemany(
"insert into data (data) values (%s)",
[
("some value: %d" % random.randint(0, 10000),)
for counter in xrange(10000)
]
)
cursor.close()
def do_time(dbapi):
global conn
conn = dbapi.connect(
user='scott', password='tiger',
database='test', host='localhost')
time = timeit.timeit(
"run_test(conn)",
"from __main__ import run_test, setup, conn; setup(conn)",
number=5
)
conn.close()
return time
psycopg2_time = do_time(psycopg2)
pg8000_time = do_time(pg8000)
print("Total time for psycopg2: %f" % psycopg2_time)
print("Total time for pg8000: %f" % pg8000_time)
_______________________________________________
DB-SIG maillist - [email protected]
https://mail.python.org/mailman/listinfo/db-sig