Hi, hackers!

I've adapted crossmatch join from pgSphere to cube for performance tests.
I've placed spatial join code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/spatialjoin.c
and node code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/joinnode.c

If you have an idea of improving the performance of this code, please
do not hesitate to express them.
One of the performance bottlenecks is code nearby heap_getattr() in
fetch_next_pair().

==Performance Tests==
I've tested performance on queries which aggregate result of the
spatial join. See cube_test.sql attached.

On 3d data, Spatial Join performs 3x faster than Nested Loop Join + GiST Scan

Nested Loop + Index Scan plan:
 HashAggregate  (cost=36841568.00..36841570.00 rows=200 width=40)
(actual time=206565.869..206738.307 rows=298731 loops=1)
   Group Key: r.nx
   ->  Nested Loop  (cost=0.41..25591568.00 rows=900000000 width=40)
(actual time=0.357..200838.416 rows=8464147 loops=1)
         ->  Seq Scan on regions r  (cost=0.00..6410.00 rows=300000
width=40) (actual time=0.015..324.436 rows=300000 loops=1)
         ->  Index Scan using idx on datatable a  (cost=0.41..55.28
rows=3000 width=64) (actual time=0.174..0.648 rows=28 loops=300000)
               Index Cond: (r.c @> c)
 Planning time: 17.175 ms
 Execution time: 206806.926 ms

Time: 206852,635 ms (03:26,853)

Spatial Join plan:
HashAggregate  (cost=56250001.00..56250003.00 rows=200 width=40)
(actual time=67373.644..67553.118 rows=298731 loops=1)
   Group Key: r.nx
   ->  Custom Scan (SpatialJoin)  (cost=0.00..1.00 rows=4500000000
width=40) (actual time=0.151..61718.804 rows=8464147 loops=1)
         Outer index: idx
         Inner index: idx1
 Planning time: 0.182 ms
 Execution time: 67630.742 ms

Time: 67631,557 ms (01:07,632)

But on more realistic 7D data with queries emulating OLAP system
performance of Spatial Join is 2 times worse than Nested Loop Join +
GiST Scan. Which comes as a complete surprise to me.
I do not see any algorithmic reason for Spatial Join to be slower.
Thus I strongly suspect that my implementation is not efficient, but
as for now I have no ideas how to improve it.

Here are plans for 7D
Nested Loop + Index Scan
 HashAggregate  (cost=3425143.00..3425743.00 rows=60000 width=72)
(actual time=122794.715..122822.893 rows=60000 loops=1)
   Group Key: r.nx
   ->  Nested Loop  (cost=0.41..2075143.00 rows=60000000 width=72)
(actual time=0.311..100478.710 rows=39817008 loops=1)
         ->  Seq Scan on r1 r  (cost=0.00..2419.00 rows=60000
width=128) (actual time=0.043..60.579 rows=60000 loops=1)
         ->  Index Scan using ix_a1_cube on a1 a  (cost=0.41..24.55
rows=1000 width=128) (actual time=0.110..1.266 rows=664 loops=60000)
               Index Cond: (c <@ r.c)
 Planning time: 0.349 ms
 Execution time: 122831.353 ms
(8 rows)

Spatial Join
 HashAggregate  (cost=6750001.00..6750601.00 rows=60000 width=72)
(actual time=241832.855..241889.360 rows=60000 loops=1)
   Group Key: r.nx
   ->  Custom Scan (SpatialJoin)  (cost=0.00..1.00 rows=300000000
width=72) (actual time=0.140..216187.111 rows=39817008 loops=1)
         Outer index: ix_r1_cube
         Inner index: ix_a1_cube
 Planning time: 0.533 ms
 Execution time: 241907.569 ms
(7 rows)

Time: 241910,440 ms (04:01,910)

Any ideas will be highly appreciated.

Best regards, Andrey Borodin.
create extension if not exists cube;

begin transaction;
SELECT setseed(.43);

\timing

create table dataTable as select cube(array[random(),random(),random()]) c, 
1::float as x1, 1::float as x2, 1::float as x3, 1::float as x4 from 
generate_series(1,3e6,1) s;
create index idx on dataTable using gist(c);

create table regions as select cube(array[x,y,z],array[x+0.1,y+0.01,z+0.01]) c, 
row_number() over() as nx from (select random() x,random() y, random() z from 
generate_series(1,3e5,1) s) q;
create index idx1 on regions using gist(c);

set max_parallel_workers_per_gather = 0;

explain analyze select 
        sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4
from dataTable a
join regions r
    on 
        r.c @> a.c
group by r.nx;

explain analyze select 
        sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4
from dataTable a
join regions r
    on 
        r.c && a.c
group by r.nx;

commit;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to