I'm currently thinking on a computational graph which can then be wrapped as a pipeline like object ... I'll try yo make a toy example solving my problem.
El 20 dic. 2017 16:33, "Manuel Castejón Limas" <manuel.caste...@gmail.com> escribió: > Thank you all for your interest! > > In order to clarify the case allow me to try to synthesize the spirit of > what I'd like to put into the pipeline using this sequence of steps: > > #%% > import pandas as pd > import numpy as np > import matplotlib.pyplot as plt > > from sklearn.cluster import DBSCAN > from sklearn.mixture import GaussianMixture > from sklearn.model_selection import train_test_split > > np.random.seed(seed=42) > > """ > Data preparation > """ > > URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/ > sin_60_percent_noise.csv" > data = pd.read_csv(URL, usecols=['V1','V2']) > X, y = data[['V1']], data[['V2']] > > (data_train, data_test, > X_train, X_test, > y_train, y_test) = train_test_split(data, X, y) > > """ > Parameters setup > """ > > dbscan__eps = 0.06 > > mclust__n_components = 3 > > paella__noise_label = -1 > paella__max_it = 20, > paella__regular_size = 400, > paella__minimum_size = 100, > paella__width_r = 0.99, > paella__n_neighbors = 5, > paella__power = 30, > paella__random_state = None > > #%% > """ > DBSCAN clustering to detect noise suspects (label == -1) > """ > > dbscan_input = data_train > > dbscan_clustering = DBSCAN(eps = dbscan__eps) > > dbscan_output = dbscan_clustering.fit_predict(dbscan_input) > > plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, > c=np.int64(dbscan_output == -1)) > > #%% > """ > GaussianMixture fitted with filtered data_train in order to help locate > the ellipsoids > but predict is applied to the whole data_train set. > """ > > mclust_input = data_train[ dbscan_output != 1] > > mclust_clustering = GaussianMixture(n_components = mclust__n_components) > mclust_clustering.fit(mclust_input) > > mclust_output = mclust_clustering.predict(data_train) > > plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, > c=mclust_output) > > #%% > """ > mclust and dbscan results are combined. > """ > > clustering_output = mclust_output.copy() > clustering_output[dbscan_output == -1] = -1 > > plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, > c=clustering_output) > > #%% > """ > Old-good Paella paper: https://link.springer. > com/article/10.1023/B:DAMI.0000031630.50685.7c > > The Paella algorithm calculates sample_weight to be used by the final step > regressor > (Yes, it is an outlier detection algorithm but we are focusing now on this > interesting collateral result). I am currently aggressively changing the > code in order to make it fit somehow with the pipeline > """ > > from paella import Paella > > paella_input = pd.concat([data, clustering_output], axis=1, inplace=False) > > paella_run = Paella(noise_label = paella__noise_label, > max_it = paella__max_it, > regular_size = paella__regular_size, > minimum_size = paella__minimum_size, > width_r = paella__width_r, > n_neighbors = paella__n_neighbors, > power = paella__power, > random_state = paella__random_state) > > paella_output = paella_run.fit_predict(paella_input, y_train) > # paella_output is a vector with sample_weight > > #%% > """ > Here we fit a regressor using sample_weight=paella_output > """ > from sklearn.linear_model import LinearRegression > > regressor_input=X_train > lm = LinearRegression() > lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) > regressor_output = lm.predict(X_train) > > #... > > In this example we can see that: > - A particular step might need results produced not necessarily from the > immediately previous step. > - The X parameter is not sequentially transformed. Sometimes we might need > to skip to a previous step > - y sometimes is the target, sometimes is not. For the regressor it is > indeed, but for the paella algorithm the prediction is expressed as a > vector representing sample_weights. > > All in all the conclusion is that the chain of processes is not as linear > as imposed by the current API. I guess that all these difficulties could be > solved by: > - Passing a dictionary through the different steps containing the partial > results that the following steps will need. > - As a christmas gift :-) , a reference to the pipeline itself inserted > in that dictionary could provide access to the internal status of the > previous steps should it be needed. > > Another interesting study case with similar needs would be a regressor > using a previous clustering step in order to fit one model per cluster. In > such case, the clustering results would be needed during the fitting. > > > Thanks for your interest! > Manolo > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn