Hi, hackers!
I've adapted crossmatch join from pgSphere to cube for performance tests.
I've placed spatial join code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/spatialjoin.c
and node code here
https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/joinnode.c
If you have an idea of improving the performance of this code, please
do not hesitate to express them.
One of the performance bottlenecks is code nearby heap_getattr() in
fetch_next_pair().
==Performance Tests==
I've tested performance on queries which aggregate result of the
spatial join. See cube_test.sql attached.
On 3d data, Spatial Join performs 3x faster than Nested Loop Join + GiST Scan
Nested Loop + Index Scan plan:
HashAggregate (cost=36841568.00..36841570.00 rows=200 width=40)
(actual time=206565.869..206738.307 rows=298731 loops=1)
Group Key: r.nx
-> Nested Loop (cost=0.41..25591568.00 rows=900000000 width=40)
(actual time=0.357..200838.416 rows=8464147 loops=1)
-> Seq Scan on regions r (cost=0.00..6410.00 rows=300000
width=40) (actual time=0.015..324.436 rows=300000 loops=1)
-> Index Scan using idx on datatable a (cost=0.41..55.28
rows=3000 width=64) (actual time=0.174..0.648 rows=28 loops=300000)
Index Cond: (r.c @> c)
Planning time: 17.175 ms
Execution time: 206806.926 ms
Time: 206852,635 ms (03:26,853)
Spatial Join plan:
HashAggregate (cost=56250001.00..56250003.00 rows=200 width=40)
(actual time=67373.644..67553.118 rows=298731 loops=1)
Group Key: r.nx
-> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=4500000000
width=40) (actual time=0.151..61718.804 rows=8464147 loops=1)
Outer index: idx
Inner index: idx1
Planning time: 0.182 ms
Execution time: 67630.742 ms
Time: 67631,557 ms (01:07,632)
But on more realistic 7D data with queries emulating OLAP system
performance of Spatial Join is 2 times worse than Nested Loop Join +
GiST Scan. Which comes as a complete surprise to me.
I do not see any algorithmic reason for Spatial Join to be slower.
Thus I strongly suspect that my implementation is not efficient, but
as for now I have no ideas how to improve it.
Here are plans for 7D
Nested Loop + Index Scan
HashAggregate (cost=3425143.00..3425743.00 rows=60000 width=72)
(actual time=122794.715..122822.893 rows=60000 loops=1)
Group Key: r.nx
-> Nested Loop (cost=0.41..2075143.00 rows=60000000 width=72)
(actual time=0.311..100478.710 rows=39817008 loops=1)
-> Seq Scan on r1 r (cost=0.00..2419.00 rows=60000
width=128) (actual time=0.043..60.579 rows=60000 loops=1)
-> Index Scan using ix_a1_cube on a1 a (cost=0.41..24.55
rows=1000 width=128) (actual time=0.110..1.266 rows=664 loops=60000)
Index Cond: (c <@ r.c)
Planning time: 0.349 ms
Execution time: 122831.353 ms
(8 rows)
Spatial Join
HashAggregate (cost=6750001.00..6750601.00 rows=60000 width=72)
(actual time=241832.855..241889.360 rows=60000 loops=1)
Group Key: r.nx
-> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=300000000
width=72) (actual time=0.140..216187.111 rows=39817008 loops=1)
Outer index: ix_r1_cube
Inner index: ix_a1_cube
Planning time: 0.533 ms
Execution time: 241907.569 ms
(7 rows)
Time: 241910,440 ms (04:01,910)
Any ideas will be highly appreciated.
Best regards, Andrey Borodin.
create extension if not exists cube;
begin transaction;
SELECT setseed(.43);
\timing
create table dataTable as select cube(array[random(),random(),random()]) c,
1::float as x1, 1::float as x2, 1::float as x3, 1::float as x4 from
generate_series(1,3e6,1) s;
create index idx on dataTable using gist(c);
create table regions as select cube(array[x,y,z],array[x+0.1,y+0.01,z+0.01]) c,
row_number() over() as nx from (select random() x,random() y, random() z from
generate_series(1,3e5,1) s) q;
create index idx1 on regions using gist(c);
set max_parallel_workers_per_gather = 0;
explain analyze select
sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4
from dataTable a
join regions r
on
r.c @> a.c
group by r.nx;
explain analyze select
sum(a.x1) as x1, sum(a.x2) as x2, sum(a.x3) as x3, sum(a.x4) as x4
from dataTable a
join regions r
on
r.c && a.c
group by r.nx;
commit;
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers