> On Dec 18, 2014, at 2:00 PM, SF Markus Elfring 
> <elfr...@users.sourceforge.net> wrote:
> 
>> Can someone please take up this task and prove to me with benchmarks that 
>> this effort will be worthwhile
>> (or at least show me what I’m going to have to put into my documentation as 
>> to when these new APIs are appropriate) ? 
> 
> Some database software supports the processing of prepared SQL statements
> by well-known application programming interfaces for decades, doesn't it?
> 
> Would you like to reuse any more experiences about execution speed and program
> safety from software development history?

Below is a benchmark script, psycopg2 vs. pg8000.  The former uses no prepared 
statements whatsoever, the latter uses prepared statements for all executions.  
 To even make it more “fair” I will illustrate an executemany(), so that pg8000 
can definitely take advantage of “caching” this statement at the prepared 
statement level.

The output times, for INSERTing 50000 rows, using five batches of 10000 (so 
that pg8000 can re-use the same prepared statement 10000 times), is:

        Total time for psycopg2: 4.992107
        Total time for pg8000: 10.770641

Now as it turns out, this test is entirely unfair.  psycopg2 is written in pure 
C code and talks to libpq directly, whereas pg8000 is pure Python and invokes 
postgresql’s protocol directly.  There is no chance of pg8000 ever approaching 
the speed of psycopg2.

In my experiences about execution speed in Python, enhancements like being able 
to cache prepared statements are dwarfed by the key issues in Python - that of 
function call overhead and code complexity.    In this case, choosing the DBAPI 
because one uses prepared statements and the other does not would not be the 
right approach.  It doesn’t matter that prepared statements are  traditionally 
useful.   The complexity they would introduce, where programmers take a guess 
that using prepared statements in conjunction with the necessary 
statement-caching architecture surrounding their use will make their program 
faster, will more often than not have a net negative effect.  Working with 
Openstack developers, I regularly have to shoot down all kinds of changes that 
are made in the name of “will (theoretically) perform better”, yet in reality, 
when benchmarked, they perform worse, or hardly better at all, at the expense 
of greater complexity.   The theories are supported by all kinds of blanket 
statements such as “iterators are faster”, “using Core statements are faster 
than the ORM”, “asynchronous code is faster than synchronous”, which at face 
value are true in some circumstances, but in practice in many specific 
situations are not at all better from a performance standpoint and only 
complicate the code. 

Basically I want there to be very good reason for this feature to be an 
explicit part of the DBAPI because it is going to give me personally a lot of 
extra headaches when people start asking for it in the name of “performance”, 
especially since that in practice, psycopg2,  pg8000 and perhaps mxODBC will be 
the only DBAPIs to add this feature, the MySQL DBAPIs are unlikely to be 
interested in this, and the pysqlite API almost certainly won’t as they have 
been stalling for years just on very rudimental transactional issues that 
remain open.   A DBAPI can already choose to make use of cached prepared 
statements as an internal optimization, using an LRU cache that could be 
configurable via the connect() function, or just via the executemany() API as 
some do right now.   

In my work with consumers of the DBAPI, prepared statement support is not high 
(or even at all) on the list of needs.  DBAPIs that can do unicode very well, 
support asynchronous APIs such as asyncio as well as implicit environments such 
as gevent, have lots of robust datatype support as well as very robust failure 
mode handling, that’s what people are looking for (also areas that psycopg2 
does extremely well), and these are all areas in which the actual pep-249 is 
IMO sorely lacking in guidance and could use more immediate attention first.


import random
import timeit
import psycopg2
import pg8000


def setup(conn):
    cursor = conn.cursor()
    cursor.execute("drop table if exists data")
    cursor.execute(
        "create table data (id SERIAL primary key, data VARCHAR(100))")
    cursor.close()


def run_test(conn):
    cursor = conn.cursor()
    cursor.executemany(
        "insert into data (data) values (%s)",
        [
            ("some value: %d" % random.randint(0, 10000),)
            for counter in xrange(10000)
        ]
    )
    cursor.close()


def do_time(dbapi):
    global conn
    conn = dbapi.connect(
        user='scott', password='tiger',
        database='test', host='localhost')

    time = timeit.timeit(
        "run_test(conn)",
        "from __main__ import run_test, setup, conn; setup(conn)",
        number=5
    )
    conn.close()
    return time

psycopg2_time = do_time(psycopg2)
pg8000_time = do_time(pg8000)

print("Total time for psycopg2: %f" % psycopg2_time)
print("Total time for pg8000: %f" % pg8000_time)


_______________________________________________
DB-SIG maillist  -  DB-SIG@python.org
https://mail.python.org/mailman/listinfo/db-sig

Reply via email to