Hope this helps! Manuel
@Article{Ciampi2008, author="Ciampi, Antonio and Lechevallier, Yves and Limas, Manuel Castej{\'o}n and Marcos, Ana Gonz{\'a}lez", title="Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets", journal="Pattern Analysis and Applications", year="2008", month="Jun", day="01", volume="11", number="2", pages="199--220", abstract="The problem of clustering subpopulations on the basis of samples is considered within a statistical framework: a distribution for the variables is assumed for each subpopulation and the dissimilarity between any two populations is defined as the likelihood ratio statistic which compares the hypothesis that the two subpopulations differ in the parameter of their distributions to the hypothesis that they do not. A general algorithm for the construction of a hierarchical classification is described which has the important property of not having inversions in the dendrogram. The essential elements of the algorithm are specified for the case of well-known distributions (normal, multinomial and Poisson) and an outline of the general parametric case is also discussed. Several applications are discussed, the main one being a novel approach to dealing with massive data in the context of a two-step approach. After clustering the data in a reasonable number of `bins' by a fast algorithm such as k-Means, we apply a version of our algorithm to the resulting bins. Multivariate normality for the means calculated on each bin is assumed: this is justified by the central limit theorem and the assumption that each bin contains a large number of units, an assumption generally justified when dealing with truly massive data such as currently found in modern data analysis. However, no assumption is made about the data generating distribution.", issn="1433-755X", doi="10.1007/s10044-007-0088-4", url="https://doi.org/10.1007/s10044-007-0088-4" } 2018-01-04 12:55 GMT+01:00 Joel Nothman <joel.noth...@gmail.com>: > Can you use nearest neighbors with a KD tree to build a distance matrix > that is sparse, in that distances to all but the nearest neighbors of a > point are (near-)infinite? Yes, this again has an additional parameter > (neighborhood size), just as BIRCH has its threshold. I suspect you will > not be able to improve on having another, approximating, parameter. You do > not need to set n_clusters to a fixed value for BIRCH. You only need to > provide another clusterer, which has its own parameters, although you > should be able to experiment with different "global clusterers". > > On 4 January 2018 at 11:04, Shiheng Duan <shid...@ucdavis.edu> wrote: > >> Yes, it is an efficient method, still, we need to specify the number of >> clusters or the threshold. Is there another way to run hierarchy clustering >> on the big dataset? The main problem is the distance matrix. >> Thanks. >> >> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel <olivier.gri...@ensta.org> >> wrote: >> >>> Have you had a look at BIRCH? >>> >>> http://scikit-learn.org/stable/modules/clustering.html#birch >>> >>> -- >>> Olivier >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn