I'm elaborating on the graph idea. A dictionary to describe the graph, the networkx package to support the graph and run it in topological order; and some wrappers for scikit-learn models.
I'm currently thinking on putting some more efforts into a contrib project. It could be something inspired by this example. Manolo #------------------------------------------------- graph_description = { 'First': {'operation': First_Step, 'input': {'X':X, 'y':y}}, 'Concatenate_Xy': {'operation': ConcatenateData_Step, 'input': [('First', 'X'), ('First', 'y')]}, 'Gaussian_Mixture': {'operation': Gaussian_Mixture_Step, 'input': [('Concatenate_Xy', 'data')]}, 'Dbscan': {'operation': Dbscan_Step, 'input': [('Concatenate_Xy', 'data')]}, 'CombineClustering': {'operation': CombineClustering_Step, 'input': [('Dbscan', 'classification'), ('Gaussian_Mixture', 'classification')]}, 'Paella': {'operation': Paella_Step, 'input': [('First', 'X'), ('First', 'y'), ('Concatenate_Xy', 'data'), ('CombineClustering', 'classification')]}, 'Regressor': {'operation': Regressor_Step, 'input': [('First', 'X'), ('First', 'y'), ('Paella', 'sample_weight')]}, 'Last': {'operation': Last_Step, 'input': [('Regressor', 'regressor')]}, } #%% def create_graph(description): cg = nx.DiGraph() cg.add_nodes_from(description) for current_name, info in description.items(): current_node = cg.node[current_name] current_node['operation'] = info['operation']( graph = cg, node_name = current_name ) current_node['input'] = info['input'] if current_name != 'First': for ascendant in set( name for name, attribute in info['input'] ): cg.add_edge(ascendant, current_name) return cg #%% cg = create_graph(graph_description) node_pos = {'First' : ( 0, 0), 'Concatenate_Xy' : ( 2, 4), 'Gaussian_Mixture' : ( 6, 8), 'Dbscan' : ( 6, 6), 'CombineClustering': ( 8, 7), 'Paella' : (10, 2), 'Regressor' : (12, 0), 'Last' : (16, 0) } nx.draw(cg, pos=node_pos, with_labels=True) #%% print("=========================") for name in nx.topological_sort(cg): print("Running: ", name) cg.node[name]['operation'].fit() print("=========================") ######################## 2017-12-22 12:09 GMT+01:00 Manuel Castejón Limas <manuel.caste...@gmail.com> : > I'm currently thinking on a computational graph which can then be wrapped > as a pipeline like object ... I'll try yo make a toy example solving my > problem. > > El 20 dic. 2017 16:33, "Manuel Castejón Limas" <manuel.caste...@gmail.com> > escribió: > >> Thank you all for your interest! >> >> In order to clarify the case allow me to try to synthesize the spirit of >> what I'd like to put into the pipeline using this sequence of steps: >> >> #%% >> import pandas as pd >> import numpy as np >> import matplotlib.pyplot as plt >> >> from sklearn.cluster import DBSCAN >> from sklearn.mixture import GaussianMixture >> from sklearn.model_selection import train_test_split >> >> np.random.seed(seed=42) >> >> """ >> Data preparation >> """ >> >> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/ >> sin_60_percent_noise.csv" >> data = pd.read_csv(URL, usecols=['V1','V2']) >> X, y = data[['V1']], data[['V2']] >> >> (data_train, data_test, >> X_train, X_test, >> y_train, y_test) = train_test_split(data, X, y) >> >> """ >> Parameters setup >> """ >> >> dbscan__eps = 0.06 >> >> mclust__n_components = 3 >> >> paella__noise_label = -1 >> paella__max_it = 20, >> paella__regular_size = 400, >> paella__minimum_size = 100, >> paella__width_r = 0.99, >> paella__n_neighbors = 5, >> paella__power = 30, >> paella__random_state = None >> >> #%% >> """ >> DBSCAN clustering to detect noise suspects (label == -1) >> """ >> >> dbscan_input = data_train >> >> dbscan_clustering = DBSCAN(eps = dbscan__eps) >> >> dbscan_output = dbscan_clustering.fit_predict(dbscan_input) >> >> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >> c=np.int64(dbscan_output == -1)) >> >> #%% >> """ >> GaussianMixture fitted with filtered data_train in order to help locate >> the ellipsoids >> but predict is applied to the whole data_train set. >> """ >> >> mclust_input = data_train[ dbscan_output != 1] >> >> mclust_clustering = GaussianMixture(n_components = mclust__n_components) >> mclust_clustering.fit(mclust_input) >> >> mclust_output = mclust_clustering.predict(data_train) >> >> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >> c=mclust_output) >> >> #%% >> """ >> mclust and dbscan results are combined. >> """ >> >> clustering_output = mclust_output.copy() >> clustering_output[dbscan_output == -1] = -1 >> >> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >> c=clustering_output) >> >> #%% >> """ >> Old-good Paella paper: https://link.springer.c >> om/article/10.1023/B:DAMI.0000031630.50685.7c >> >> The Paella algorithm calculates sample_weight to be used by the final >> step regressor >> (Yes, it is an outlier detection algorithm but we are focusing now on >> this interesting collateral result). I am currently aggressively changing >> the code in order to make it fit somehow with the pipeline >> """ >> >> from paella import Paella >> >> paella_input = pd.concat([data, clustering_output], axis=1, inplace=False) >> >> paella_run = Paella(noise_label = paella__noise_label, >> max_it = paella__max_it, >> regular_size = paella__regular_size, >> minimum_size = paella__minimum_size, >> width_r = paella__width_r, >> n_neighbors = paella__n_neighbors, >> power = paella__power, >> random_state = paella__random_state) >> >> paella_output = paella_run.fit_predict(paella_input, y_train) >> # paella_output is a vector with sample_weight >> >> #%% >> """ >> Here we fit a regressor using sample_weight=paella_output >> """ >> from sklearn.linear_model import LinearRegression >> >> regressor_input=X_train >> lm = LinearRegression() >> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) >> regressor_output = lm.predict(X_train) >> >> #... >> >> In this example we can see that: >> - A particular step might need results produced not necessarily from the >> immediately previous step. >> - The X parameter is not sequentially transformed. Sometimes we might >> need to skip to a previous step >> - y sometimes is the target, sometimes is not. For the regressor it is >> indeed, but for the paella algorithm the prediction is expressed as a >> vector representing sample_weights. >> >> All in all the conclusion is that the chain of processes is not as linear >> as imposed by the current API. I guess that all these difficulties could be >> solved by: >> - Passing a dictionary through the different steps containing the partial >> results that the following steps will need. >> - As a christmas gift :-) , a reference to the pipeline itself inserted >> in that dictionary could provide access to the internal status of the >> previous steps should it be needed. >> >> Another interesting study case with similar needs would be a regressor >> using a previous clustering step in order to fit one model per cluster. In >> such case, the clustering results would be needed during the fitting. >> >> >> Thanks for your interest! >> Manolo >> >>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn