Re: Pylons/Pyramid Performance

Mike Orr Wed, 06 Jun 2012 13:16:59 -0700

On Wed, Jun 6, 2012 at 4:03 AM, Vlad K. <[email protected]> wrote:
> On 06/06/2012 07:49 AM, Andi wrote:
>>
>> just a benchmark, but better than nothing. found that during our research.
>> http://blog.curiasolutions.com/the-great-web-framework-shootout/
>>
>> andi
>>
>> (sent right out of my head)
>>
>
>
> The Curiasolutions shootout is interesting. However, even for a synthetic
> benchmark it is highly unbalanced. For example, it shows Pyramid yielding
> more rps than Bottle on Hello World. But then throws Pyramid way lower than
> Bottle on templated db task.
>
> If you take a look at the benchmark code, you'll notice:
>
> - both use SQLite with a local file db, which is ok
> - Bottle uses SQLite driver directly
> - Pyramid uses SQLAlchemy which incurs significant overhead
>
> You can use database drivers directly in Pyramid. You don't need sqlalchemy,
> or transaction extensions, they are not required by Pyramid, just a chosen
> default.


SQLAlchemy's overhead is not necessarily significant, and there are
ways to use SQLAlchemy to minimze the overhead. That comes at the
expense of convenience, of course, so you'd want to do a side-by-side
comparison of a typical task to see how much the overhead matters.

The biggest performance improvement is when you do bulk queries in the
database. One SQL statement that does some calculations and returns a
small number of result records. That avoids the overhead of loading
every record into some Python data type. You can also do bulk updates
and deletes.  Optimized bulk processing at the C level is a feature of
all SQL-compatible databases. It often is not a feature of non-SQL
databases. For instance, CouchDB can perform a Javascript query and
return some records, but I don't know if it's any faster than going
through all the records in Python.

The second-largest optimization is to use SQLAlchemy's SQL builder
level rather than the ORM level. You can send it a SQL string to
execute, or use the builder methods to construct the SQL statement.
This essentially runs at the same speed as raw DB-API because it's
doing the same thing. The result is an iterable of RowProxy's, which
incur some minimal overhead. It you access fields by key rather than
attribute (x[0] or x["foo"] vs x.foo), it's supposed to be faster.
RowProxy uses lazy evaluation, so it avoids processing the underlying
row tuple except as necessary.

The ORM has to instantiate a Python object for every record, and keep
track of which objects have had their attributes changed in Python.
But again it does lazy evaluation, so it's not like it sets every
attribute on initialization.  Recent versions of SQLAlchemy also have
a feature that you can construct a query using the ORM methods, but if
it's a query on certain fields rather than on an entire ORM object, it
returns RowProxy's just like the SQL level, so it bypasses most of the
ORM's overhead.

But again, you should do a side-by-side comparision to see how much
the overhead actually is, because sometimes it surprises you. I have
an import routine that reads 10,000+ records from CSV files and puts
them in an empty database, and it takes 30 seconds to run either with
or without the ORM. On the other hand, I have some log-processing
scripts that process hundreds of thousands or millions of records, and
the speedup is significant if I switch from ORM-level processing to
SQL-level processing. (But again, I can use the ORM methods to
construct the queries; I just avoid returning ORM objects or inserting
ORM objects.)

The third thing you can do is to hold a long-lived connection
throughout the application, rather than letting the engine check out a
connection on every query. That avoids the overhead of the connection
pool. But that probably makes little difference. The purpose of the
connection pool is to avoid the larger overhead of actually connecting
to the database on every query. That's slow on some databases like
PostgreSQL, but fast on others like SQLite. So the pool actually
improves performance, and raw DB-API does not have a connection pool.
This again depends on your application. A short-lived, single-threaded
utility can just hold a connection for simplicity. But a multithreaded
web application really benefits from a pool, so that you don't have to
manage connections. (Or at most, you hold a connection open for a
single function or single request.)

SQLAlchemy is wonderful because it's multi-level: you can give it SQL
strings, use the SQL builder, or use the ORM, depending on the
application or even in different places in the same application.
Python never had a multi-level database library before SQLAlchemy. I
don't know how common it is in other programming languages. Michael
Bayer also writes excellent documentation. (He wrote SQLAlchemy and
Mako.) So I would definitely recommend using it.

The other issue is, raw benchmarks of X framework does N requests per
hour are usually unrealistic. They use an empty application to measure
the framework's overhead. But in the real world of complex
applications using databases and performing calculations, the overhead
of the framework is dwarfed by the overhead of the application code.
If you have a very small, simple application like Twitter with a huge
number of request, then the framework's performance would be close to
the benchmark. Otherwise it will degrade in ways that aren't framework
specific. (I.e., they'd be the same in Pyramid or Flask.) If your
application is so high-volume that it approaches the hardware's
capacity, then you should look at parallel servers as well as
different frameworks or languages. It may be that the cost of a second
server is less than the programming-time cost or inconvenience of
using a "simple, streamlined" framework or C-like language. Unless the
application is very simple, in which case a minimalistic framework may
be perfect for it.

In terms of Python WSGI applications, there are two separate
overheads: the WSGI server, and the framework/application. The
CherryPy server is considered the most robust at high loads, compared
to other multhithreaded Python servers. You can use it with Pyramid,
just set the "[server:main]" section in the INI file. Asynchronous
servers may have higher performance than mulththreaded, but the
difficulty of making an application asynchronous-safe may outweigh the
advantages. You can also use a module like mod_wsgi to avoid the
overhead of a separate WSGI HTTP server.

-- 
Mike Orr <[email protected]>

-- 
You received this message because you are subscribed to the Google Groups 
"pylons-discuss" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en.

Re: Pylons/Pyramid Performance

Reply via email to