[jira] [Commented] (MAHOUT-1991) Implement naive DBSCAN Algorithm - O(n^2) complexity

ASF GitHub Bot (JIRA) Wed, 30 Aug 2017 08:06:28 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147397#comment-16147397
 ]


ASF GitHub Bot commented on MAHOUT-1991:
----------------------------------------

Github user rawkintrevo commented on a diff in the pull request:

    https://github.com/apache/mahout/pull/334#discussion_r136095747
  
    --- Diff: website/docs/algorithms/clustering/dbscan/index.md ---
    @@ -0,0 +1,101 @@
    +---
    +layout: algorithm
    +title: DBSCAN
    +theme:
    +   name: retro-mahout
    +---
    +
    +### About
    +
    +[DBSCAN Clustering](https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf)
    + is a simple and efficient density based clustering algorithm for group 
objects
    + into clusters. DBSCAN differs from other clustering algorithms in the way 
it
    + clusters objects. DBSCAN uses the density information of points to define 
clusters
    + Each object is represented as a point in the multidimensional feature 
space. The algorithm
    + uses the notion of core points, border points and noise points to perform 
the clustering.
    +
    + Parameters of the Algorithm: (Epsilon, Min points)
    + * Epsilon (Eps) - denotes the radius of the sphere that defines the 
neighbourhood of a given data point
    + * Min points (Minpts) - the minimum threshold for the number of points in 
a point p's Eps-neighbourhood to
    + make p a core point
    + 
    + Basic Definitions
    + * Eps-neighbourhood of a point - The Eps-neighbourhood of a point 'p' is 
defined as the set of points S
    +   such that for each x 'belongs to' S, distance(p,x) <= eps
    + * Core point - A point 'p' is called a core point if the number of points 
in its Eps-neighbourhood is >= Minpts
    + * Border point - A point that is not a core point but such that there 
exists a 'p' in its neighbourhood
    +   such that p is a core point
    + * Noise point - Any point that does not classify as a core point or 
border point is a noise point. Noise points
    +   do not belong to any cluster.
    +
    + Algorithm:
    + The DBSCAN algorithm takes as parameters Eps and MinPts. Eps restricts 
the neighbourhood of a point and MinPts
    + denotes the threshold for the number of neighbours in Eps-neighbours of 
point p, for it to form a cluster.
    + The algorithm randomly chooses a point q, and finds it Eps-neighbourhood. 
If the number of points in the
    + Eps-neighbourhood of q is less than MinPts, it is marked as a noise 
point. Otherwise it is marked as a core point
    + and a new cluster is created. If p is a core point, iteratively a 
Eps-neighbourhood query is performed on each of
    + its neighbours and points are added to the created cluster.  If no 
unvisited points can be added to cluster, the new
    + cluster is complete and no points will be added to the cluster in 
subsequent iterations. Then another unvisited point q'
    + is picked up and the same process is repeated. The algorithm terminates 
when all the points have been visited
    + i.e. when some of the points are added to clusters and some are marked as 
noise points. The total number of
    + Eps-neighbourhood queries performed is equal to the size of the data (n) 
and if no indexing data structure is
    + used then the calculation of Eps-neighbourhood query involved the 
distance calculation with all the points (n).
    + Thus, if no indexing data structure is used, the complexity of the DBSCAN 
algorithm is O(n^2)
    +
    +#### Strategy for parallelization
    +
    +The DBSCAN algorithm is an inherently sequential algorithm. A quick look 
at the pseudo code provided in the original
    +paper will show why.
    +
    +The parallelization strategy for DBSCAN involves 3 phases.
    +* Data distribution phase preserving spatial locality
    +* InCore DBSCAN Clustering
    +* Merging InCore clusters to form Global clusters
    +
    +Few methods to parallelize DBSCAN have been proposed in the recent past 
and can be found 
[here](http://delivery.acm.org/10.1145/2390000/2389081/a62-patwary.pdf?ip=14.139.128.15&id=2389081&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663674_474f7b3eaa352d59c2905c1f530907c1)
 and 
[here](http://delivery.acm.org/10.1145/2840000/2834894/a2-gotz.pdf?ip=14.139.128.15&id=2834894&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=973347130&CFTOKEN=86904048&__acm__=1503663710_9001c4bdf3d4dc3ed6658d8a6aaf1190)
    +
    +In order to minimize the communication costs for large datasets, we 
provide an approximate DBSCAN algorithm. Data division among nodes is done 
randomly
    +violating the spatial locality condition. Clustering is performed InCore 
and cluster merging is done by merging two clusters
    +that are nearby and that contain significant number of points.
    +
    +### Parameters
    +<div class="table-striped">
    +  <table class="table">
    +    <tr>
    +        <th>Parameter</th>
    +        <th>Description</th>
    +        <th>Default Value</th>
    +    </tr>
    +    <tr>
    +        <td><code>'distanceMeasure</code></td>
    +        <td>The metric used for calculating distance, see <a 
href="../distance-metrics.html">Distance Metrics</a></td>
    +        <td><code>'Euclidean</code></td>
    +    </tr>
    +    <tr>
    +        <td><code>'Eps</code></td>
    +        <td>The radius that defines the neighbourhood of a 
point</code></td>
    +        <td>NA</td>
    +    </tr>
    +    <tr>
    +        <td><code>'Minpts</code></td>
    +        <td>The threshold for number of points in the neighbourhood of 
another point</code></td>
    +        <td>NA</td>
    +    </tr>
    +    </tr>
    +  </table>
    +</div>
    +
    +
    +### Example
    +
    +    //Using the InCoreDBSCAN module
    +    val input = dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 1.0), (6.0, 
5.2, -5.2, 5.3), (7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, -10.0))
    +
    +    import org.apache.mahout.math.algorithms.clustering.InCoreDBSCAN
    +    val distanceMetric = 
DistanceMetricSelector.select(DistanceMetricSelector.namedMetricLookup('Euclidean)))
    --- End diff --
    
    +1 utilizing existing distanceMetrics framework.


> Implement naive DBSCAN Algorithm - O(n^2) complexity
> ----------------------------------------------------
>
>                 Key: MAHOUT-1991
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1991
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Algorithms
>            Reporter: Aditya AS
>            Assignee: Aditya AS
>
> Implement the naive DBSCAN algorithm in Mahout Samsara, as part of the 
> Algorithms Framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MAHOUT-1991) Implement naive DBSCAN Algorithm - O(n^2) complexity

Reply via email to