Sorry, I forgot to also include the SQL for the 2 temporary tables used by the cluster_points_buffer.sql script.
create table clusters1(id serial, geom geometry); create index clusters1_geom_gist on clusters1 using gist(geom); create table clusters2(id serial, geom geometry); create index clusters2_geom_gist on clusters2 using gist(geom); Sébastien On Fri, Jan 7, 2011 at 15:54, Sébastien Lorion <s...@thestrangefactory.com> wrote: > Happy New Year everyone! > > I got time to test out the st_buffer+st_intersect solution (which I > will name "buffer") and the st_dwithin+with recursive solution > ("recursive"). I also came up with another one based on nested cursor > FOR loops and st_dwithin ("cursor"). I have not tested the kmean > solution yet. > > To quickly recap the problem at hand: I need to determine all the > clusters of points that are within range of each other. For example, > say the points represent vehicles with a radar: if one can be seen by > the other, then together they form a cluster. Over time, the points > are moving, so I need to be able to update the results with only the > point that moved. My data set size is about 100M+ points. One thing to > note is that I do not expect to process 100M points at once, rather it > will be a gradual process with points added over time. There may be > sharp peaks of changes thought, so speed is still very important. > > Briefly, the running time of all 3 solutions *seems* to be > exponentially related to the average number of points inside a cluster > and memory does not seems to be a bottleneck for any. The "recursive" > solution has a very steep curve while the "buffer" solution scales > better, but it will also choke on semi-large data sets (1M or more). > The "cursor" solution fares much better and seems to be able to handle > any data set size in a reasonable time (1M points in about 2h30 on my > Lenovo T61p with a SSD). Once the cloud cluster is fully in place, I > will be able to test with larger data sets. > > Some limitations of the "buffer" solution are that it requires > additional processing to get the points that are part of each cluster > afterward, it is rather difficult to update only a part of the points > (only those that moved) and also, I did not came up with a way to > handle points with different ranges (but did not spend a lot of time > thinking about it either). > > I made only a small tweak to the "recursive" solution, so that > isolated points are quickly discarded. > > For the "buffer" solution, I first create a buffer (ST_Buffer) around > all points with a radius equals to half the unit range and then > iteratively merge (ST_Union) all intersecting buffers (ST_Intersects). > This requires 2 additional tables containing temporary result > geometries. First the buffers are created in table1, then each > iteration puts the result of the merging operation in the other unused > table (truncating it first), i.e. points->table1, table1->table2, > table2->table1, and so on. The function returns the name of table > where the final result is stored. > > As for the "cursor" solution, here is the algorithm: > > for each ptA in points > for each ptB in reachable points from ptA > if ptA has no group and ptB has no group > ptA.grp = ptB.grp = new_id() > else if ptA has no group and ptB has a group > ptA.grp = ptB.grp > else if ptA has a group and ptB has no group > ptB.grp = ptA.grp > else if ptA.grp != ptB.grp > update all points with a group equals to ptB.grp with ptA.grp > > One might further optimize it by first dividing the points in a grid > and for the inner loop, process only the points that are in adjacent > cells of the point of origin. The size of cells should be so that > reachable points are never located 2 cells away (ie they are in > immediately adjacent cells), to ease/speed up the processing. > > I am currently working on a Scala version which will allow me to add > multi-threading and offload the processing to multiple nodes instead > of being restrained only to the DB server itself if I do it in pl/sql. > With the grid optimization, it is already about an order of magnitude > faster. > > I have attached the code of the 3 functions I used for testing. You > will also need to create a sequence named "seq_pts_clusterId". I also > included a data set with 100k points if you want to play around with > the fonctions. Update the "range" column to form bigger clusters. > > You will notice that the "cursor" code (dwithin2) have some commented > out code. It is because I tried to update the data using the current > position of the cursors, instead of doing it via a standard WHERE > query, but I had data refresh problems, ie the outer loop cursor was > still using old data when I reached a record that was previously > updated by the inner loop. I am not sure why, I tried a couple of > different syntaxes, but to no avail. If someone has an idea, feel free > to share it ... > > Sébastien > _______________________________________________ postgis-users mailing list postgis-users@postgis.refractions.net http://postgis.refractions.net/mailman/listinfo/postgis-users