Hello guys.

 

My issue kind of hits multiple topics, but the main question is about
performance. I think you need to understand the background a little bit to
be able to help me. So I will firstly define the problem and my solutions to
it and place the questions for you to the end of this message.

 

Problem:

I have a table of observations of objects on the sky. The most important
columns are the coordinates (x,y). All other columns in there are just
additional information about the observation. The problem is that when I
take an image of the same object on the sky twice, the coordinates x,y won't
be the same, they will be only close to each other. My task is to generate a
common identifier to all of the observations of the same object and assign
the observations to it (N:1 relation). The criterium is that all of the
observations which are within 1 arcsec of this identifier are considered as
the same object. I keep the identifiers in a separate table (objcat) and
have a foreign key in the observations table.

The reason why I solve the performance issues here is that the table of
observations has atm cca 3e8 rows after 1.5 year of gathering the data. The
number growth is linear.

 

Technical background:

I'm trying to keep the whole algoritm in DB if possible because I have a
good PostgreSQL plugin Q3C for indexing the coordinates of the objects on
the sky (https://code.google.com/p/q3c/source/browse/README.q3c). It also
has quite a few stored procedures to look in that table for near neighbors
which is what I'm doing. The function q3c_join(x1, x2, y1, y2, radius)
returns true if the object y is within radius of the object x. It simply
generates a list of index bitmap or queries with the operators <=, >= which
define the position on sky. Asking for the nearest neighbors are then only
index scans.

 

Solution:

After lot of experimentation with clustering the objects and trying to
process them all together in one "silver-bullet" SQL query I decided to use
some more simple approach. The main problem with the "process all at once
approach" is that the finding neighbours for each observation is in
definition quadratic and for 3e8 rows just runs out of disk space (~TBs of
memory for the temporary results).

The simplest approach I could think of is that I process each row of the 3e8
rows sequentially and ask:

Do I have an identifier in the radius of 1 arcsec?

No: Generate one and assign me to it.

Yes: Update it and assigne me to it. The update is done as weighted average
- I keep the number of how many observations the identifier has been
computed. The result is that the identifier will have average coordinates of
all the observations it identifies - it will be the center.

 

So here I come with my procedure. It has 3 params. The first two are range
of oids to list in the table. Used for scaling and parallelization of the
algorithm. The third is the radius in which to search for the neighbours.

 

DROP TYPE IF EXISTS coords;

CREATE TYPE coords AS (

                raj2000 double precision,

                dej2000 double precision

                );

 

 

DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);

CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int, radius
double precision)

                RETURNS VOID AS $$

DECLARE 

                cur1 CURSOR FOR 

                               SELECT 

                                               raj2000, dej2000

                               FROM

                                               \schema.observation AS obs

                               WHERE 

                                               obs.oid >= fromOID

                               AND

                                               obs.oid < toOID;

                curr_raj2000 double precision;

                curr_dej2000 double precision;

                curr_coords_cat coords;

                cnt int;        

 

BEGIN 

/*SELECT current_setting('transaction_isolation') into tmp;

raise notice 'Isolation level %', tmp;*/

OPEN cur1;

cnt:=0;

LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;

LOOP 

                FETCH cur1 INTO curr_raj2000, curr_dej2000;

                               EXIT WHEN NOT found;

                

                WITH 

                               upsert

                AS

                               (UPDATE 

                                               \schema.objcat

                               SET 

                                               ipix_cat=q3c_ang2ipix(

                                                               (raj2000 *
weight + curr_raj2000) / (weight + 1),

                                                               (dej2000 *
weight + curr_dej2000) / (weight + 1)

                                               ),

                                               raj2000 = (raj2000 * weight +
curr_raj2000) / (weight + 1),

                                               dej2000 = (dej2000 * weight +
curr_dej2000) / (weight + 1),

                                               weight=weight+1 

                               WHERE 

                                               q3c_join(curr_raj2000,
curr_dej2000,

                                                               raj2000,
dej2000,

                                                               radius)

                               RETURNING *),

                ins AS

                (INSERT INTO 

                                               \schema.objcat 

                                               (ipix_cat, raj2000, dej2000,
weight)

                               SELECT

                                               (q3c_ang2ipix(curr_raj2000,
curr_dej2000)),

                                               curr_raj2000,

                                               curr_dej2000,

                                               1

                WHERE NOT EXISTS

                               (SELECT * FROM upsert)

                RETURNING *)

                UPDATE 

                               \schema.observation

                SET 

                               id_cat = (SELECT DISTINCT

                                                               id_cat

                                               FROM

                                                               upsert

                                               UNION 

                                               SELECT 

                                                               id_cat

                                               FROM

                                                               ins

                                               WHERE id_cat IS NOT NULL

                                               LIMIT 1)

                WHERE CURRENT OF cur1;

                cnt:=cnt+1;

 

                IF ((cnt % 100000 ) = 0) THEN

                               RAISE NOTICE 'Processed % entries', cnt;

                END IF;

                               

END LOOP;

CLOSE cur1;

END;

$$ LANGUAGE plpgsql;

 

Results: When I run the query only once (1 client thread), it runs cca 1 mil
rows per hour. Which is days for the whole dataset. When I run it in
parallel with that lock to ensure pessimistic synchronization, it runs
sequentially too J the other threads just waiting. When I delete that lock
and hope to solve the resulting conflicts later, the ssd disk serves up to 4
threads relatively effectively - which can divide my days of time by 4 -
still inacceptable.

 

The reason is quite clear here - I'm trying to write something in one cycle
of the script to a table - then in the following cycle I need to read that
information. 

 

Questions for you:

1.       The first question is if you can think of a better way how to do
this and maybe if SQL is even capable of doing such thing - or do I have to
do it in C? Would rewriting the SQL function to C help?

2.       Could I somehow bend the commiting during the algorithm for my
thing? Ensure that inside one cycle, the whole part of the identifiers table
would be kept in memory for faster lookups?

3.       Why is this so slow? J It is comparable to the quadratic algorithm
in the terms of speed - only does not use any memory.

 

I tried to sum this up the best I could - for more information please don't
hesitate to contact me.

 

Thank you very much for even reading this far.

 

Best Regards,

 

Jiri Nadvornik

 

 

Reply via email to