On 6/29/05, Rudi Starcevic <[EMAIL PROTECTED]> wrote:
> >I do my batch processing daily using a python script I've written. I
> >found that trying to do it with pl/pgsql took more than 24 hours to
> >process 24 hours worth of logs. I then used C# and in memory hash
> >tables to drop the time to 2 hours, but I couldn't get mono installed
> >on some of my older servers. Python proved the fastest and I can
> >process 24 hours worth of logs in about 15 minutes. Common reports run
> >in < 1 sec and custom reports run in < 15 seconds (usually).
> When you say you do your batch processing in a Python script do you mean
> a you are using 'plpython' inside
> PostgreSQL or using Python to execut select statements and crunch the
> data 'outside' PostgreSQL?
> Your reply is very interesting.
Sorry for not making that clear... I don't use plpython, I'm using an
external python program that makes database connections, creates
dictionaries and does the normalization/batch processing in memory. It
then saves the changes to a textfile which is copied using psql.
I've tried many things and while this is RAM intensive, it is by far
the fastest aproach I've found. I've also modified the python program
to optionally use disk based dictionaries based on (I think) gdb. This
signfincantly increases the time to closer to 25 min. ;-) but drops
the memory usage by an order of magnitude.
To be fair to C# and .Net, I think that python and C# can do it
equally fast, but between the time of creating the C# version and the
python version I learned some new optimization techniques. I feel that
both are powerful languages. (To be fair to python, I can write the
dictionary lookup code in 25% (aprox) fewer lines than similar hash
table code in C#. I could go on but I think I'm starting to get off
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster