On Thu, Jun 08, 2017 at 03:17:17PM +0200, Patrick Marais wrote: > Hello, > > I've just started using mlpack, and wished to cluster 512,000 points, in a > 63-dim feature space. Kmeans does this in < 15 sec on my laptop. I tried > the same with DBSCAN and killed it at about 700 sec while it was still (I > think) churning away (and consuming a lot of memory). I thought the use of > a Rangesearch kd-tree should have reduced this to something fairly quick? I > copied example code from the command line version of dbsacn, since there is > no tutorial. My understanding is that building a default RangeSearch should > give you a kd-tree? That seems to be what the code shows too. > > I was wondering if my choice of epsilon/minPts could have such a dramatic > effect. I histogrammed the distance between all point pairs and set epsilon > based on this, minpts I set to 64, i.e.dim+1 as per advice I had seen > elsewhere. > > So, I guess my questions are > 1) Is the DBSCAN somehow using a naive/brute force search? > 2) is DBSCAN intrinsically slow for lage numbers of high dimensional > points? I hear of people clustering 10's of millions if points, so I didn't > think my data set was unreasonably large. > > I'd be very disappointed if (2) was indeed the case. > > Any advice/comments appreciated. I've included he code snippet below.
Hi Patrick, You are right that DBSCAN uses kd-tree-based range search as a default. In fact you can simplify your code a little bit: mlpack::dbscan::DBSCAN<> dbs(epsilon, minclusterpts); The problem that you are encountering is intrinsic to DBSCAN but it is solvable. When I perform a range search with some radius, for DBSCAN I must collect the indices of every point that lies within that range. Thus for 512k points, supposing that I choose a large enough epsilon, then the range search will return every single other point in the dataset, and the RAM usage to do that will be non-negligible. Therefore, reducing epsilon significantly can go a long way towards reducing the computational cost of DBSCAN. I would try reducing epsilon by an order of magnitude or more and seeing if this improves your runtime while still giving acceptable clustering results. If epsilon can't be reduced to a small enough value that gives good clustering results in reasonable time, then it's possible that DBSCAN is not a good fit for the dataset you are clustering. I hope this helps; let me know if I can clarify anything else. Thanks, Ryan -- Ryan Curtin | "...wildcat... ...wild... cat... ... ...pow... [email protected] | ...wildcat... I'm going to go." - Eli Cash _______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
