Re: [PyGreSQL] DB API-2.0 + copy from stdin

Christopher Sean Hilton Thu, 09 Oct 2008 14:56:12 -0700

On Oct 9, 2008, at 5:04 PM, Christoph Zwerschke wrote:

> Since there were not complaints, I have now checked that in:
> http://www.pygresql.org/cvsweb.cgi/pygresql/module/pg.py?view=log
>


Thanks,

It's a nice feature to have and I finally got a chance to move my code  
from test to production. Unfortunately when I started to throw real  
world test data at it it turned out to be a net performance hit rather  
than a net performance gain.
Before access to copy I had this:

      fooDataList = [ l.rstrip for l in open("foo_data.txt") ]

      db = pgdb.connect()

      cur = db.cursor()

      cur.execute("CREATE TEMPORARY TABLE raw_foo(" +
                  "raw_foo_id serial not null primary key, foo_data  
text)")
      cur.execute("PREPARE raw_foo_insert(text) AS " +
                  "INSERT INTO raw_foo(foo_data) VALUES ($1)")
      cur.executemany("EXECUTE raw_foo_insert(%s)", fooDataList)
      cur.execute("DEALLOCATE raw_foo_insert")

      ...

      cur.close()
      db.commit()
      db.close()

This took about 0.5 ~ 0.6 seconds to load 1000 rows and about 84.5  
seconds to process it for a total processing time of about 85 seconds.  
After copy I can do this:

      classicDb = pg.connect()

      db = pgdb.connect(cnx = classicDb)
      cur = db.cursor()

      cur.execute("CREATE TEMPORARY TABLE load_foo(foo_data text)")
      classicDb.query("COPY load_foo FROM stdin")

      for l in fooDataList:
          classicDb.putline(l)

      classicDb.putline('\\.')
      classicDb.endcopy()

      cur.execute("CREATE TEMPORARY TABLE raw_foo(" +
                  "raw_foo_id serial not null primary key, foo_data  
text)")
      cur.execute("INSERT INTO raw_foo (foo_data) " +
                  "SELECT foo_data from load_foo")

      ...

      cur.close()
      db.commit()
      del db
      classicDb.close()

This loads 1000 rows in about 0.25 seconds so if just loading data is  
your gig then COPY table... is a win. But processing after the load  
takes 89.25 seconds for a total run time of 90 seconds. Somewhere I  
lost 5 seconds. That's what I get for not analyzing the problem before  
I began optimizating the code.

For 4.75 seconds I'm leaving it as copy because I can wrap up the  
details in a class and basically add a LoadExternalTable() Method that  
takes a tablename and a list of values. I expect that the performance  
benefit will really hit me when I build my new postgresql server on  
more up to date hardware. I'm also planning to move temporary tables  
to a special tablespace that will exist on a memory filesystem rather  
than on disk. I'm hoping that that performs well. I'll publish results  
when I get them.

-- Chris


Chris Hilton                       tildeChris -- http://myblog.vindaloo.com
                                         email -- chris/at/vindaloo/ 
dot/com
.~ 
~ 
.--.~ 
~.--.~~.--.~~.--.~~.--.~~.--.~~.--.~~.--.~~.--.~~.--.~~.--.~~.--.~~.
                          "I'm on the outside looking inside, What do  
I see?
                                Much confusion, disillution, all  
around me."
                                          -- Ian McDonald / Peter  
Sinfield

_______________________________________________
PyGreSQL mailing list
[email protected]
http://mailman.vex.net/mailman/listinfo/pygresql

Re: [PyGreSQL] DB API-2.0 + copy from stdin

Reply via email to