Hi Ryan, I have made the data set (273MB) available (saved with mlpack::data::Save() ) at
https://drive.google.com/open?id=0B_8-awG7IJzyT2xaU3RQZDhmYzQ This should be 542K points of dim=63. I ran DBSCAN with epsilon=1 and minPts=64. I have also attached an image (and left iy in the dir) which is a histogram of inter-point distances under the regular Euclidean metric, so epsilon =1.0 seems like it should not have returned that many points? Hopefully you can reproduce the error - please let me know if you need anything more. I checked for NaN and INF in the data max, and everything seems to be fine. Cheers, Patrick On Thu, Jun 8, 2017 at 8:25 PM, Ryan Curtin <[email protected]> wrote: > On Thu, Jun 08, 2017 at 05:58:07PM +0200, Patrick Marais wrote: > > HI Ryan, > > > > Thanks for the quick reply and suggestions. > > > > I suspect you're right about the neighbor query getting too many points; > I > > tried reducing it massively, and now it runs for about 15 minutes > > and the seg faults. See below - The number of points is a bit higher > than > > I stated: 542326, but the input is an arma::mat with this number of > columns > > and rows=63. I'm fairly sure I have checked everything to remove NaN's > and > > so on. Could it be possible that the size of the data set is causing > > something to fail? The memory usage was not maxed out at this point (only > > about 2.5GB over 8GB). > > > > I just ran it through gdb, which doesn't seem to have all the debug > > information, so besides the place at which it crashed, I can't say much > > else. > > > > Not sure what to try next. If I remove the cal to dbscan and use Kmeans, > > everything works (although the clusters are not what I'd really like). > > Hmm, not sure exactly what the issue is here. You may have uncovered a > bug. Is there any chance I can get the dataset to try and reproduce the > failure? > > Another sanity check would be to try an even smaller epsilon; if it's > still taking 15 minutes, then it still may be finding very many points > for each range search. > > Clustering is a hard problem, and there's definitely a big tradeoff > between "fast and bad" (k-means lives here) and "slow but good" (maybe > you could say this about DBSCAN, but even DBSCAN is faster than some > things like spectral clustering methods). > > -- > Ryan Curtin | "Bye-bye, goofy woman. I enjoyed repeatedly > [email protected] | throwing you to the ground." - Ben Jabituya >
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
