Evaluating on large datasets is easy if the sufficient statistics are just the contingency matrix.
On Tue., 14 May 2019, 11:19 pm Tom Augspurger, <tom.augspurge...@gmail.com> wrote: > If anyone is interested in implementing these, dask-ml would welcome > additional > metrics that work well with Dask arrays: > https://github.com/dask/dask-ml/issues/213. > > On Tue, May 14, 2019 at 2:09 AM Uri Goren <ugo...@gmail.com> wrote: > >> Sounds like you need to use spark, >> this project looks promising: >> https://github.com/xiaocai00/SparkPinkMST >> >> On Tue, May 14, 2019 at 5:12 AM lampahome <pahome.c...@mirlab.org> wrote: >> >>> >>> Uri Goren <ugo...@gmail.com> 於 2019年5月3日 週五 下午7:29寫道: >>> >>>> I usually use clustering to save costs on labelling. >>>> I like to apply hierarchical clustering, and then label a small sample >>>> and fine-tune the clustering algorithm. >>>> >>>> That way, you can evaluate the effectiveness in terms of cluster purity >>>> (how many clusters contain mixed labels) >>>> >>>> See example with sklearn here : >>>> https://youtu.be/GM8L324MuHc?list=PLqkckaeDLF4IDdKltyBwx8jLaz5nwDPQU >>>> >>>> >>>> But if my dataset is too large to load into memory, will it work? >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn