Hello,

sklearn.cluster.Birch follows the original BIRCH paper, that appears to be mostly focused on efficiently building the hierarchical clustering tree (and not so much on making the later analysis user friendly). The attributes exposed by Birch are those that could be reasonably exposed given the scikit-learn API constraints. Though, one does have access to the full cluster hierarchy via the Birch.root_.

As Joel said, traversing the tree is a standard CS problem, and there is also probably a number of operations that could be done with it, depending on the application. For instance, for my use case, I found that re-constructing the Birch hierarchy using a custom container class for each subcluster was the easiest to run subsequent analysis with. A detailed example can be found here,
http://freediscovery.io/doc/stable/python/examples/birch_cluster_hierarchy.html
Alternatively, I wonder if converting the tree to a format readable by some tree/graph specialized library (e.g. networkx) could be useful for analysis.

Generally there is a number of places in scikit-learn where trees are used (Birch, AgglomerativeClustering, tree bases classifiers, etc) but for now there is no way to export the constructed tree to some standard format (apart for sklearn.tree.export_graphviz). Not sure if this is realistically achievable though..

--
Roman

On 20/09/17 13:40, Sema Atasever wrote:
I need this information to use it in a scientific study and
I think that a function interface would make this easier.

Thank you for your answer.

On Sat, Sep 16, 2017 at 1:53 PM, Joel Nothman <joel.noth...@gmail.com
<mailto:joel.noth...@gmail.com>> wrote:

    There is no such thing as "the data samples in this cluster". The
    point of Birch being online is that it loses any reference to the
    individual samples that contributed to each node, but stores some
    statistics on their basis. Roman Yurchak has, however, offered a PR
    where, for the non-online case, storage of the indices contributing
    to each node can be optionally turned on:
    https://github.com/scikit-learn/scikit-learn/pull/8808
    <https://github.com/scikit-learn/scikit-learn/pull/8808>

    As for finding what is contained under any particular node,
    traversing the tree is a fairly basic task from a computer science
    perspective. Before we were to support something to make this much
    easier, I think we'd need to be clear on what kinds of use case we
    were supporting. What do you hope to do with this information, and
    what would a function interface look like that would make this much
    easier?

    Decimals aren't a practical option as the branching factor may be
    greater than 10, it is a hard structure to inspect, and susceptible
    to computational imprecision. Better off with a list of tuples, but
    what for that is not easy enough to do now?



    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn
    <https://mail.python.org/mailman/listinfo/scikit-learn>




_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to