On Fri, Jun 09, 2017 at 01:07:27PM +0200, Patrick Marais wrote: > Hi Ryan, > > I have made the data set (273MB) available (saved with > mlpack::data::Save() ) at > > https://drive.google.com/open?id=0B_8-awG7IJzyT2xaU3RQZDhmYzQ > > This should be 542K points of dim=63. > > I ran DBSCAN with epsilon=1 and minPts=64. > > I have also attached an image (and left iy in the dir) which is a histogram > of inter-point distances under the regular Euclidean metric, so epsilon > =1.0 seems like it should not have returned that many points? > > Hopefully you can reproduce the error - please let me know if you need > anything more. > > I checked for NaN and INF in the data max, and everything seems to be fine.
Hi Patrick, Thanks for the dataset. I found multiple issues, each of which I solved. First there was some code that needed to be updated, but that part was easy. Second, I found that the step that turned the range search results into assignments for each point was not efficient and could be significantly improved via the use of a union find structure. Otherwise, there would be a stack overflow, which would happen with your dataset for large epsilon values. Happily, mlpack already has one implemented, so I just plugged that in and got significant speedup. Now the vast majority of runtime on your dataset is the range search itself. Third, I found that with larger epsilon, the system does indeed run out of RAM, because each point has too many neighbors in the range search set. i.e. if each point has 1k neighbors, and there are 500k points, then we need an array of size 500M to hold them all... I thought about it briefly and realized that with the union find structure, we do not need to hold the range search results for every single point at once, and instead we can perform individual range searches for each point in the dataset. That might be slower (depending on the dataset and lots of other things), but it will definitely in all cases use less RAM. The --single_mode option can be used for that mode. So, for instance, now I can run like this: $ bin/mlpack_dbscan -i ~/Downloads/big-data-rows_63_cols_542K.bin -e 0.9 -v --single_mode -C c.csv [INFO ] Loading '/home/ryan/Downloads/big-data-rows_63_cols_542K.bin' as Armadillo binary formatted data. Size is 63 x 542329. [INFO ] DBSCAN clustering on point 10000... <...skipped...> [INFO ] DBSCAN clustering on point 540000... [INFO ] 1154 clusters found. [INFO ] Saving CSV data to 'c.csv'. [INFO ] [INFO ] Execution parameters: [INFO ] assignments_file: [INFO ] centroids_file: c.csv [INFO ] epsilon: 0.9 [INFO ] help: 0 [INFO ] info: [INFO ] input_file: /home/ryan/Downloads/big-data-rows_63_cols_542K.bin [INFO ] min_size: 5 [INFO ] naive: 0 [INFO ] single_mode: 1 [INFO ] tree_type: kd [INFO ] verbose: 1 [INFO ] version: 0 [INFO ] Program timers: [INFO ] loading_data: 0.324149s [INFO ] range_search/computing_neighbors: 511.964841s (8 mins, 31.9 secs) [INFO ] saving_data: 0.049956s [INFO ] total_time: 522.897931s (8 mins, 42.8 secs) For your dataset it seems like --single_mode is faster, but that's not necessarily true for all data (especially as the dimensionality gets lower). To use the new code, you can clone from the git master branch at https://github.com/mlpack/mlpack.git and make the 'mlpack_dbscan' target. The updated code will either be released soon-ish as part of an mlpack 2.2.4 backport release, or as part of mlpack 3.0.0 (though the 3.0.0 release may still be some months off, not sure). Let me know if I can clarify anything. Thanks for the report! I think mlpack's DBSCAN implementation is significantly improved as a result. Ryan -- Ryan Curtin | "I am." [email protected] | - Joe _______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
