Hey scikit people,

I know that the first purpose of scikit is not to handle big data but would
you be interested by a PR of my silhouette block implementation ? My
benches have shown that it is a bit slower than the scikit one when data is
small but it divides memory usage by n_cluster ^ 2. Plus it can be
parallelized. But, obviously, the code is less readable.

I am currently working with data that does not fit in memory so I try to
minimize its usage as much as I can. I have also implemented an online
variance (and explained variance) object based on [Chan79] approach (there
may be better ones, I haven't checked). This is not hard to code but it can
be useful for some people.

Alexandre.

[Chan79] "Updating Formulae and a Pairwise Algorithm for Computing Sample
Variances." Chan 79



On Fri, May 10, 2013 at 6:26 PM, Bao Thien <ntba...@gmail.com> wrote:

> Hi Alexandre,
>
> It sounds very great. I will try it and let you know soon.
>
> Regards,
>
> T.Bao
>
>
> On Fri, May 10, 2013 at 6:19 PM, Alexandre ABRAHAM <
> abraham.alexan...@gmail.com> wrote:
>
>> Bao,
>>
>> Sorry for the delay. I have push a new version of the code on the gist
>> (there is now a n_jobs keyword parameter). It should use a bit more memory.
>>
>> Fast bench (see main in the gist) :
>> Scikit silhouette (113.294149s): -0.013992
>> Block silhouette (23.485517s): -0.013992
>> Block silhouette parallel (23.351142s): -0.013992
>>
>> I only have 2 cores so this is not very significant. If you have more,
>> feedback is welcome !
>>
>> Alexandre.
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Nguyen Thien Bao
>
> NeuroInformatics Laboratory (NILab),
> Fondazione Bruno Kessler (FBK), Trento, Italy
>  Centro Interdipartimentale Mente e Cervello (CIMeC)
> Universit`a degli Studi di Trento, Italy
> Email: ntba...@gmail.com  or  ntbao...@yahoo.com
> Cellphone: +39.345.293.1006 (Italy)
> Cellphone: +84.996.352.452 (VietNam)
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to