Hello, I've just started using mlpack, and wished to cluster 512,000 points, in a 63-dim feature space. Kmeans does this in < 15 sec on my laptop. I tried the same with DBSCAN and killed it at about 700 sec while it was still (I think) churning away (and consuming a lot of memory). I thought the use of a Rangesearch kd-tree should have reduced this to something fairly quick? I copied example code from the command line version of dbsacn, since there is no tutorial. My understanding is that building a default RangeSearch should give you a kd-tree? That seems to be what the code shows too.
I was wondering if my choice of epsilon/minPts could have such a dramatic effect. I histogrammed the distance between all point pairs and set epsilon based on this, minpts I set to 64, i.e.dim+1 as per advice I had seen elsewhere. So, I guess my questions are 1) Is the DBSCAN somehow using a naive/brute force search? 2) is DBSCAN intrinsically slow for lage numbers of high dimensional points? I hear of people clustering 10's of millions if points, so I didn't think my data set was unreasonably large. I'd be very disappointed if (2) was indeed the case. Any advice/comments appreciated. I've included he code snippet below. Thanks, Patrick std::cerr << "Starting DBSCAN clustering for " << trainingData.n_cols << " samples...\n"; mlpack::range::RangeSearch<> RS; // default KDtree range search acceleration // if minPts == 0, then compute actual minPts used as Dim+1, where dim is the dimensio of the feature vector int minclusterpts = (minClustSize < 1 ? trainingData.n_rows+1 : minClustSize); mlpack::dbscan::DBSCAN<mlpack::range::RangeSearch<> > dbs(epsilon, minclusterpts, RS); // do actual clustering int numCluster = (int)dbs.Cluster(trainingData, assignments);
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
