Re: [scikit-learn] Pipegraph example: KMeans + LDA

Andreas Mueller Sun, 28 Oct 2018 19:15:46 -0700


On 10/24/18 4:11 AM, Manuel Castejón Limas wrote:

Dear all,
as a way of improving the documentation of PipeGraph we intend toprovide more examples of its usage. It was a popular demand to showapplication cases to motivate its usage, so here it is a very simplecase with two steps: a KMeans followed by a LDA.
https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py

This short example points out the following challenges:
- KMeans is not a transformer but an estimator

KMeans is a transformer in sklearn:http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform

(you can't get the labels to be the output which is what you're doinghere, but it is a transformer)

- LDA score function requires the y parameter, while its input doesnot come from a known set of labels, but from the previous KMeans
- Moreover, the GridSearchCV.fit call would also require a 'y' parameter

Not true if you provide a scoring that doesn't require y or if you don'tspecify scoring and the scoring method of the estimator doesn't require y.


GridSearchCV.fit doesn't require y.

- It would be nice to have access to the output of the KMeans step aswell.
PipeGraph is capable of addressing these challenges.
The rationale for this example lies in theidentification-reconstruction realm. In a scenario where the classlabels are unknown, we might want to associate the quality of theclustering structure to the capability of a later model to be able toreconstruct this structure. So the basic idea here is that if LDA iscapable of getting good results it was because the information of theKMeans was good enough for that purpose, hinting the discovery of agood structure.

Can you provide a citation for that? That seems to heavily depend on theclustering algorithms and the classifier.

To me, stability scoring seems more natural: https://arxiv.org/abs/1007.1075

This does seem interesting as well, though, haven't thought about this.

It's cool that this is possible, but I feel this is still not really a"killer application" in that this is not a very common pattern.


Also you could replicate something similar in sklearn with

def estimator_scorer(testing_estimator):
    def my_scorer(estimator, X, y=None):
        y = estimator.predict(X)

        return np.mean(cross_val_score(testing_estimator, X, y))

Though using that we'd be doing nested cross-validation on the test set...

That's a bit of an issue in the current GridSearchCV implementation :-/There's an issue by Joel somewhereto implement something that allows training without splitting which iswhat you'd want here.You could run the outer grid-search with a custom cross-validationiterator that returns all indices as training and test set and only doesa single split, though...


class NoSplitCV(object):

    def split(self, X, y, class_weights):

        indices = np.arange(_num_samples(X))
        yield indices, indices

Though I acknowledge that your code only takes 4 lines, while mine takes8 (thought if we'd add NoSplitCV to sklearn mine would also only take 4lines :P)


I think pipegraph is cool, not meaning to give you a hard time ;)

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Pipegraph example: KMeans + LDA

Reply via email to