New to Postgres and am prototyping a migration from Oracle to Postgres 9.0.1 on 
Linux. Starting with the data warehouse. Current process is to load the data 
from
our OLTP (Oracle), dump it into another instance of Oracle for staging and 
manipulation, then extract it and load it into Infobright. I am trying
to replace the Oracle instance used for staging and manipulation with Postgres. 
Kettle (PDI), a Java ETL tool, is used for this process.

Came across a problem I find perplexing. I recreated the dimensional tables in 
Oracle and the fields that are integers in Oracle became integers
in Postgres. Was experiencing terrible performance during the load and narrowed 
down to a particular dimensional lookup problem. The table
dim_carrier holds about 80k rows. You can see the actual query issued by Kettle 
below, but basically I am looking up using the business key from
our OLTP system. This field is carrier_source_id and is indexed as you can see 
below. If I change this field from an integer to a real, I get
about a 70x increase in performance of the query. The EXPLAIN ANALYZE output is 
nearly identical, except for the casting of 1 to a real when the column
is a real. In real life, this query is actually bound and parameterized, but I 
wished to simplify things a bit here (and don't yet know how to EXPLAIN ANALYZE 
a parameterized
query). Now in terms of actual performance, the same query executed about 25k 
times takes 7 seconds with the real column, and 500 seconds with the integer 
column.

What gives here? Seems like integer (or serial) is a pretty common choice for 
primary key columns, and therefore what I'm experiencing must be an anomoly.



                Table "hits_olap.dim_carrier"
      Column       |            Type             | Modifiers
-------------------+-----------------------------+-----------
 carrier_id        | integer                     | not null
 dim_version       | smallint                    |
 dim_effect_date   | timestamp without time zone |
 dim_expire_date   | timestamp without time zone |
 carrier_source_id | integer                     |
 carrier_name      | character varying(30)       |
 carrier_type      | character varying(30)       |
 carrier_scac      | character varying(4)        |
 carrier_currency  | character varying(3)        |
 current_row       | smallint                    | default 0
Indexes:
    "dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
    "idx_dim_carrier_lookup" btree (carrier_source_id)

VACUUM
ANALYZE
REINDEX

 EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER 
WHERE CARRIER_SOURCE_ID = '1'  AND now() >= DIM_EFFECT_DATE
 AND now() < DIM_EXPIRE_DATE;

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_dim_carrier_lookup on dim_carrier  (cost=0.00..12.10 
rows=2 width=6) (actual time=0.076..0.077 rows=1 loops=1)
   Index Cond: (carrier_source_id = 1)
   Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date))  Total 
runtime: 0.108 ms
(4 rows)

ALTER TABLE
ALTER TABLE
                Table "hits_olap.dim_carrier"
      Column       |            Type             | Modifiers
-------------------+-----------------------------+-----------
 carrier_id        | integer                     | not null
 dim_version       | smallint                    |
 dim_effect_date   | timestamp without time zone |
 dim_expire_date   | timestamp without time zone |
 carrier_source_id | real                        |
 carrier_name      | character varying(30)       |
 carrier_type      | character varying(30)       |
 carrier_scac      | character varying(4)        |
 carrier_currency  | character varying(3)        |
 current_row       | smallint                    | default 0
Indexes:
    "dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
    "idx_dim_carrier_lookup" btree (carrier_source_id)

VACUUM
ANALYZE
REINDEX

 EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER 
WHERE CARRIER_SOURCE_ID = '1'  AND now() >= DIM_EFFECT_DATE
 AND now() < DIM_EXPIRE_DATE;

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_dim_carrier_lookup on dim_carrier  (cost=0.00..12.10 
rows=2 width=6) (actual time=0.068..0.069 rows=1 loops=1)
   Index Cond: (carrier_source_id = 1::real)
   Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date))  Total 
runtime: 0.097 ms
(4 rows)



Thanks for the help,

Dave Greco

Reply via email to