Re: [PERFORM] ETL optimization

Jacques Caron Thu, 23 Jun 2005 15:06:25 -0700

Hi,

At 21:38 23/06/2005, Bricklen Anderson wrote:

Situation:

I'm trying to optimize an ETL process with many upserts (~100k aggregatedrows)

(no duplicates allowed). The source (table t2) table holds around 14 million
rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in about
100,000 distinct rows in the destination table (t1).



What I've tried:

i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ishrows), and

try an update first, check the ROW_COUNT, if 0, then do an insert.
...
run time: approx. 25 mins


ii. in a function (pseudo code), (table name is dynamic):
...
up_stm :=
'UPDATE '||t1||' SET x=t2.x
FROM    (select sum(x),a,b,c
        from t2
        group by a,b,c) as t2
WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c';

EXECUTE up_stm;

ins_stm :=
'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c
FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2
WHERE NOT EXISTS
        (select true from '||t1||'
        where '||t1||'.a=t2.a
        and '||t1||'.b=t2.b
        and '||t1||'.c=t2.c
        limit 1)';

EXECUTE ins_stm;
...

I have a similar situation, and the solution I use (though I haven't reallytested many different situations):

- have a trigger ON INSERT which does:

UPDATE set whatever_value=NEW.whatever_value,... WHEREwhatever_key=NEW.whatever.key AND...

IF FOUND THEN
 RETURN NULL;
ELSE
 RETURN NEW;
END IF;
- use COPY

For optimal performance, a different trigger function is created for eachtable, which allows the query plan of the UPDATE to be cached.


Let us know how that works out for you and if you find a better solution!

Jacques.



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [PERFORM] ETL optimization

Reply via email to