RE: MI-L Eliminating duplicate Polygons

Quantitative Decisions Tue, 12 Mar 2002 15:10:36 -0800

At 07:28 AM 3/12/02 -0500, Jacques Paris wrote:
>In comparing 2 regions, one must consider two kinds of data related to shape
>and to position. <snip--rest of message below>


This problem is encountered in other GISes, too (e.g., 
ArcView).  Everything Paris says is correct, useful, and insightful (as 
always!), but the problem is still not that difficult.  One can glean a lot 
of information even from polygon centers, areas, and perimeters.

The idea is to spend most of your time computing easy-to-calculate 
properties.  If a polygon's MBR is the fastest thing to obtain, go for 
it.  Otherwise look at the MBR center, the polygon area, the centroid, even 
the vertex count or the coordinates of the first few vertices--it depends 
on how the GIS implements its polygons.  Use these as an initial, 
potentially redundant, key to check for matches.  After all, Mr. Paris' 
objections notwithstanding, if two polygons have two different centroids, 
then the polygons are different, and you can move on.  If you are in a 
situation where you don't expect very many duplicates, and you rarely have 
polygons that overlap, then using centroids can be an efficient discriminator.

In such a fashion you can winnow down the number of direct 
polygon-to-polygon comparisons you have to make.  Ultimately, you can try 
subtracting one polygon from another and computing the area--it's a good 
method--but when the underlying coordinates are represented by INTs (as in 
MapInfo), you will run into the problem that quite extensive polygon 
slivers can have areas that are computed as zero.  (This seems to crop up 
even using floating-point coordinates, too, probably because there are bugs 
in almost every computational geometry program extant.)  If, however, the 
polygons are well and truly identical in every respect, and the subtraction 
algorithm is a robust and correct one, then the resulting shape will be 
null and have a null MBR, so you can test for that.

In summary, here's the algorithm for finding duplicates in any collection C 
of objects (such as polygons) that are expensive to compare individually 
but for which inexpensive keys can be computed:

1.      Create an associative array A that will assign collections of 
objects to keys.
2.      For each P in C:
2a.             Compute a "cheap" key K based on P.
2b.             Add P to A[K].
         End
3.      For each key K in A[]:
3a.             Until A[K] is empty Do:
3b.                     Let P be any element in A[K]
3c.                     Compare P to the remaining elements of A[K], if 
any, using an exact method, and remove every object in A[K] that does not 
match P.
3d.                     Output P.
                 End
         End

Notes
-------
If memory is severely limited compared to the number of objects to process, 
make two passes through the dataset.  In the first, simply count the number 
of occurrences of each key that shows up.  Then retain in the associative 
array only those keys for which the count is two or greater.  The second 
pass proceeds much like the algorithm above, except (a) polygons with no 
matching key in A[] are output directly and (b) at step 2b, the expensive 
comparisons 3a-3d are performed immediately, to ensure that duplicates are 
never put into A.  In a post-processing step, all objects collected in A in 
the second pass are output.

If you don't have associative arrays, then your keys will have to be whole 
numbers (or maybe you can press your database into service: table summaries 
accomplish similar things; or even, implement a home-grown associative 
array with a hash table or something like it--it's easy to do).  There are 
lots of ways to convert things like centroids, etc., into whole numbers: 
use your imagination or research "hashing" in your favorite algorithms 
textbook.

If your polygon subtraction algorithm is unreliable, avoid using it in step 
3c by first comparing everything else you can think of: area, perimeter, 
centroid, number of vertices, coordinates of MBR, etc.  If all else fails, 
hash the sequence of vertex coordinates into the largest integer you can 
handle and play the odds: they're highly in your favor.

Cheers,
Bill Huber
www.quantdec.com

>Two regions may have identical shapes and not be
>coincidental in space. The shape elements are many: number of parts, number
>of nodes, area, perimeter. Positional elements are few in nature,
>essentially x,y coordinates of corresponding nodes.
>
>Notice I do not include the centroid because it is not a reliable element
>having a unique "value"  for a region; it is not computed but assigned and
>can be changed within limits. Even if one would "reset" all centroids before
>hand (converttopline + converttoregion), there is no guarantee that the
>assigned centroid would be different if there are some slight differences in
>an outlying polygon (a polygon not containing the centroid), because it
>could be constrained by the limits of the "central" polygon.
>
>It would be much easier if there could be only one criterion that would do
>the trick. When I explored that question two years ago, I tried to use a
>simple all encompassing formula to compare two regions oa and ob and I found
>that area(erase(oa,ob),"sq mi")=0 could be a good solution. However the
>comparison to 0 was not efficient because the mathematics of the operator
>had a tendency then to "leave some accidental crumbs". I am now thinking
>that the solution could be area(erase(oa,ob),"sq mi") =
>area(erase(ob,oa),"sq mi") but isn't that just hoping that the "crumbs" will
>be different?
>
>Jacques Paris



_______________________________________________________________________
List hosting provided by Directions Magazine | www.directionsmag.com |
To unsubscribe, send e-mail to [EMAIL PROTECTED] and
put "unsubscribe MapInfo-L" in the message body.

RE: MI-L Eliminating duplicate Polygons

Reply via email to