Thank you all for your interest! In order to clarify the case allow me to try to synthesize the spirit of what I'd like to put into the pipeline using this sequence of steps:
#%% import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN from sklearn.mixture import GaussianMixture from sklearn.model_selection import train_test_split np.random.seed(seed=42) """ Data preparation """ URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_ percent_noise.csv" data = pd.read_csv(URL, usecols=['V1','V2']) X, y = data[['V1']], data[['V2']] (data_train, data_test, X_train, X_test, y_train, y_test) = train_test_split(data, X, y) """ Parameters setup """ dbscan__eps = 0.06 mclust__n_components = 3 paella__noise_label = -1 paella__max_it = 20, paella__regular_size = 400, paella__minimum_size = 100, paella__width_r = 0.99, paella__n_neighbors = 5, paella__power = 30, paella__random_state = None #%% """ DBSCAN clustering to detect noise suspects (label == -1) """ dbscan_input = data_train dbscan_clustering = DBSCAN(eps = dbscan__eps) dbscan_output = dbscan_clustering.fit_predict(dbscan_input) plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=np.int64(dbscan_output == -1)) #%% """ GaussianMixture fitted with filtered data_train in order to help locate the ellipsoids but predict is applied to the whole data_train set. """ mclust_input = data_train[ dbscan_output != 1] mclust_clustering = GaussianMixture(n_components = mclust__n_components) mclust_clustering.fit(mclust_input) mclust_output = mclust_clustering.predict(data_train) plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=mclust_output) #%% """ mclust and dbscan results are combined. """ clustering_output = mclust_output.copy() clustering_output[dbscan_output == -1] = -1 plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, c=clustering_output) #%% """ Old-good Paella paper: https://link.springer.com/article/10.1023/B:DAMI. 0000031630.50685.7c The Paella algorithm calculates sample_weight to be used by the final step regressor (Yes, it is an outlier detection algorithm but we are focusing now on this interesting collateral result). I am currently aggressively changing the code in order to make it fit somehow with the pipeline """ from paella import Paella paella_input = pd.concat([data, clustering_output], axis=1, inplace=False) paella_run = Paella(noise_label = paella__noise_label, max_it = paella__max_it, regular_size = paella__regular_size, minimum_size = paella__minimum_size, width_r = paella__width_r, n_neighbors = paella__n_neighbors, power = paella__power, random_state = paella__random_state) paella_output = paella_run.fit_predict(paella_input, y_train) # paella_output is a vector with sample_weight #%% """ Here we fit a regressor using sample_weight=paella_output """ from sklearn.linear_model import LinearRegression regressor_input=X_train lm = LinearRegression() lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) regressor_output = lm.predict(X_train) #... In this example we can see that: - A particular step might need results produced not necessarily from the immediately previous step. - The X parameter is not sequentially transformed. Sometimes we might need to skip to a previous step - y sometimes is the target, sometimes is not. For the regressor it is indeed, but for the paella algorithm the prediction is expressed as a vector representing sample_weights. All in all the conclusion is that the chain of processes is not as linear as imposed by the current API. I guess that all these difficulties could be solved by: - Passing a dictionary through the different steps containing the partial results that the following steps will need. - As a christmas gift :-) , a reference to the pipeline itself inserted in that dictionary could provide access to the internal status of the previous steps should it be needed. Another interesting study case with similar needs would be a regressor using a previous clustering step in order to fit one model per cluster. In such case, the clustering results would be needed during the fitting. Thanks for your interest! Manolo
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn