Re: [HACKERS] GSoC 2014: Implementing clustering algorithms in MADlib

Hai Qian Mon, 07 Apr 2014 08:08:59 -0700

Hi Maxence,

This is really really good. Hopefully we will have a fruitful summer, and
make MADlib a better product.


Thanks

Hai

--
*Pivotal <http://www.gopivotal.com/>*
A new platform for a new era


On Sat, Apr 5, 2014 at 7:14 AM, Maxence Ahlouche <[email protected]
> wrote:

> Hi all,
>
> 2014-04-03 9:04 GMT+02:00 Hai Qian <[email protected]>:
>
>> Hi Maxence,
>>
>> We are very glad that you are enthusiastic about MADlib.Your proposal
>> looks
>> very nice. The algorithms that you proposed will be very useful, and can
>> serve as a good start. After learning about the C++ abstraction layer and
>> testing suite, you should be able to quickly grasp how to program for
>> MADlib. We can fill in the details of the plan later. The conflict in your
>> time schedule shouldn't be a problem.
>>
>> It would be nice if you could provide a little bit more introduction to
>> these algorithms and their importance in the proposal.
>>
>> Overall it is very good. And thank you again
>>
>> Hai
>>
>>
> As you requested, I've described the clustering algorithms on my github
> [0]. I'm copying it at the end of this email, just in case.
> In the case of OPTICS, I still need some time to grasp the details of how
> it works, so I've only included a basic description of this one.
>
> Regards,
> Maxence
>
> [0] https://github.com/viodlen/gsoc_2014/blob/master/algorithm_details.rst
>
>
> Details about the different clustering algorithms
> =================================================
>
> This file describes the k-means algorithm, which is currently the only
> clustering algorithm implemented in MADlib, and the k-medoids and
> OPTICS algorithm, which I plan to implement this summer, as a GSoC
> project. I'll also explain the DBSCAN algorithm, as OPTICS is only an
> extension of this one.
>
> The goal of all these algorithms is to determine clusters in a set of
> points. While this problem is NP-hard, it can be solved efficiently
> thanks to these algorithms, even though the result is usually not
> perfect.
>
> I've tried not to add too many details, and taken some shortcuts, so
> there may be some inaccuracies. I wanted to keep this readable (not
> sure that goal was achieved, though), but if anyone wants more
> precisions, I'll gladly add them.
>
> k-means
> -------
>
> The k-means algorithm is the one implemented in MADlib. It selects
> *k* points, either randomly, among the dataset, or set by the user,
> to use as initial centroids (center of clusters). *k* must be
> defined by the user; the algorithm is unable to guess the number of
> clusters.
>
> All the points in the dataset are then affected to the nearest
> centroid, thus making *k* clusters.
>
> The next step is to compute the new centroids as a "mean" of all the
> points in a cluster. Then reassign all the points to then new
> centroids, and start all over again until there is no change in
> cluster assignation.
>
> This algorithm is usually pretty fast (even though it can be made to
> take an exponential time to converge). Still, it is easy to get a
> wrong result because of local convergence, e.g. having one cluster
> split in two parts by the algorithm, or two clusters merged. It is
> also pretty sensitive to outliers (points that don't obviously belong
> to any cluster), and the final result depend greatly on the initial
> centroids.
>
> This algorithm doesn't work well with non-euclidean distance, in
> general.
>
> Another think to note is that k-means will result in Voronoi cell
> shaped clusters, which is not always what we want.
>
> k-medoids
> ---------
>
> The k-medoids algorithm works the same way as k-means, save for a few
> exceptions.
>
> With k-medoids, the centroids are always points of the dataset (and
> are then called medoids). The new medoids are the points in clusters
> which minimize the sum of pairwise dissimilarities (in other terms,
> the point for which the average distance to other points in the
> cluster is minimal). This makes the algorithm less sensitive to
> outliers than k-means.
>
> This algorithm is computationnally more intensive than k-means, but
> still fast.
>
> As for k-means, it is possible to run the algorithm several times with
> different initial centroids, and get the best result (i.e. the one
> that minimizes the sum of distances from points to their centroids).
>
> DBSCAN
> -------
>
> DBSCAN (*Density-Based Spatial Clustering of Applications with Noise*)
> is a clustering algorithm based on the density of clusters.
>
> This adresses several limitations of k-means and k-medoids: it does
> not assign a cluster to outliers, and allows the detection of
> weird-shaped clusters. Moreover, it doesn't need to be told the number
> of clusters.
>
> A point is called dense if enough other points are near enough from
> it, where "enough other points" and "near enough" are defined by the
> parameters *min_points* and *epsilon*.
>
> *min_points* therefore represents the minimum number of points
> required to make a cluster, and *epsilon* is the maximum distance a
> point can be from a cluster to be considered part of this cluster.
>
> For every unassigned point, count his *epsilon*-neighbours (not sure
> this term exists, but I'll use it for convenience). If there are too
> few, consider this point an outlier; else, create a new cluster and
> put the current point in it, along with its neighbours, the neighbours
> of its neighbours, and so on.
>
> The main problem with DBSCAN is that it doesn't work well for clusters
> with very different densities.
>
> OPTICS
> ------
>
> OPTICS (*Ordering Points to Identify the Clustering Structure*)
> addresses this issue.
>
> The first step in OPTICS is to order the points so that two points
> that are near each other will be in nearby positions, and store the
> smallest reachability distance of each point (i.e. the distance at
> which the point would be considered dense by DBSCAN).
>
> The next step is to determine the clusters from the data obtained at
> the previous step. This can be done visually: by plotting the ordered
> points on *x* and their smallest reachability distance on *y*, the
> clusters are the valleys formed by the curve.
>
> I'll add more details to this section when I understand better how all
> of this works.
>
> Bibliography
> ------------
> http://en.wikipedia.org : where clustering algorithms are well
> detailed
>
> http://www.stat.cmu.edu/~ryantibs/datamining/lectures/04-clus1-marked.pdf
> : introduction to k-means and k-medoids
>
>
> http://www.vitavonni.de/blog/201211/2012110201-dbscan-and-optics-clustering.html
> : short introduction to DBSCAN and OPTICS
>
> http://www.cs.uiuc.edu/class/fa05/cs591han/papers/ankerst99.pdf :
> original paper presenting OPTICS
>
>
>
> --
> Maxence Ahlouche
> 06 06 66 97 00
>

Re: [HACKERS] GSoC 2014: Implementing clustering algorithms in MADlib

Reply via email to