reductionista opened a new pull request #571: URL: https://github.com/apache/madlib/pull/571
This is a combination of over 92 commits that have been squashed, and while it works and the performance is great, it's still a "work in progress" in that, we need to add tests for it, clean up some things, and make sure the brute force option still functions properly, along with any other special cases (such as different names for id and point columns, etc.). Also todo: clean up and consolidate commit messages, which for now have just been aggregated and pruned a bit. Opening now so that a PR review can begin for this. This is a re-write / 2nd attempt at improving over the original brute force implementation. This seems to be working now, and should have roughly O(N/S log N) runtimes, where N is the size of the input dataset (number of points/rows) and S is the number of segments used (note: the algorithm now decides on its own how many segments to use, as sometimes splitting things up further will only increase the runtime. Therefore S is not necessarily the total number of segments in the cluster, it may be less depending on the structure of the dataset.) This borrows many aspects from the previous attempt (using an overlapping spatial binary tree to segment the dataset and using an R tree index to speed up range queries), but differs in its approach in other ways. === Rewording of previous summary of the improvements The brute force DBSCAN runs on N^2 time. To improve this, we split the data into different overlapping regions, running DBSCAN on each in parallel, then merging the results together on the coordinator. More specifically: 1. The data is segmented into connected spatial regions, with a binary tree optimized specifically for DBSCAN. 2. Each leaf of the optimized spatial tree runs the dbscan algorithm on the points in its spatial region, including some points from other regions near its boundaries, using an R tree index for efficient range queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@madlib.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org