Hi Alexandre,
I have a few questions on your experiment though:
> - how many clusters do you have (as the block method speed and memory
> consumption is dependent of the number of cluster)
>
the dataset is clustered into 50 clusters
> - have you monitored memory usage ? In particular, did you swap at any
> moment ? Because swapping is a time killer.
>
I have not monitored the memory usage. But the computation time here is
the real CPU time, not the elapse time
> - have you some results using the scikit learn function (and using
> sampling to make data fit into memory) ?
>
I only can run the original scikit-learn silhouette score with the size of
data less than 40K (or the sub_sample <40K). With the size 50K it becomes
out of memory. Due to that, I don't run the original one on my whole
dataset. I will re-check it and let you know soon.
The big advantage of the block version is that it can easily be
> parallelized so if your memory is not full, we can still speed up
> computation !
>
>
I am not very clear about this. Do you mean that at the same time, we can
run multiple blocks with the same sub_sample size, and then save time?
Regards,
T.Bao
>
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their applications. This 200-page book is written by three acclaimed
leaders in the field. The early access version is available now.
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general