This start to look as the dask project. Do you know it? Le mar. 26 déc. 2017 05:49, Manuel Castejón Limas <manuel.caste...@gmail.com> a écrit :
> I'm elaborating on the graph idea. A dictionary to describe the graph, the > networkx package to support the graph and run it in topological order; and > some wrappers for scikit-learn models. > > I'm currently thinking on putting some more efforts into a contrib project. > > It could be something inspired by this example. > > Manolo > > #------------------------------------------------- > > > > graph_description = { > 'First': > {'operation': First_Step, > 'input': {'X':X, 'y':y}}, > > 'Concatenate_Xy': > {'operation': ConcatenateData_Step, > 'input': [('First', 'X'), > ('First', 'y')]}, > > 'Gaussian_Mixture': > {'operation': Gaussian_Mixture_Step, > 'input': [('Concatenate_Xy', 'data')]}, > > 'Dbscan': > {'operation': Dbscan_Step, > 'input': [('Concatenate_Xy', 'data')]}, > > 'CombineClustering': > {'operation': CombineClustering_Step, > 'input': [('Dbscan', 'classification'), > ('Gaussian_Mixture', 'classification')]}, > > 'Paella': > {'operation': Paella_Step, > 'input': [('First', 'X'), > ('First', 'y'), > ('Concatenate_Xy', 'data'), > ('CombineClustering', 'classification')]}, > > 'Regressor': > {'operation': Regressor_Step, > 'input': [('First', 'X'), > ('First', 'y'), > ('Paella', 'sample_weight')]}, > > 'Last': > {'operation': Last_Step, > 'input': [('Regressor', 'regressor')]}, > > } > > #%% > def create_graph(description): > cg = nx.DiGraph() > cg.add_nodes_from(description) > for current_name, info in description.items(): > current_node = cg.node[current_name] > current_node['operation'] = info['operation']( graph = cg, > node_name = current_name ) > current_node['input'] = info['input'] > if current_name != 'First': > for ascendant in set( name for name, attribute in > info['input'] ): > cg.add_edge(ascendant, current_name) > > return cg > #%% > cg = create_graph(graph_description) > > node_pos = {'First' : ( 0, 0), > 'Concatenate_Xy' : ( 2, 4), > 'Gaussian_Mixture' : ( 6, 8), > 'Dbscan' : ( 6, 6), > 'CombineClustering': ( 8, 7), > 'Paella' : (10, 2), > 'Regressor' : (12, 0), > 'Last' : (16, 0) > } > > nx.draw(cg, pos=node_pos, with_labels=True) > > #%% > > print("=========================") > for name in nx.topological_sort(cg): > print("Running: ", name) > cg.node[name]['operation'].fit() > > print("=========================") > > ######################## > > > > > > 2017-12-22 12:09 GMT+01:00 Manuel Castejón Limas < > manuel.caste...@gmail.com>: > >> I'm currently thinking on a computational graph which can then be wrapped >> as a pipeline like object ... I'll try yo make a toy example solving my >> problem. >> >> El 20 dic. 2017 16:33, "Manuel Castejón Limas" <manuel.caste...@gmail.com> >> escribió: >> >>> Thank you all for your interest! >>> >>> In order to clarify the case allow me to try to synthesize the spirit >>> of what I'd like to put into the pipeline using this sequence of steps: >>> >>> #%% >>> import pandas as pd >>> import numpy as np >>> import matplotlib.pyplot as plt >>> >>> from sklearn.cluster import DBSCAN >>> from sklearn.mixture import GaussianMixture >>> from sklearn.model_selection import train_test_split >>> >>> np.random.seed(seed=42) >>> >>> """ >>> Data preparation >>> """ >>> >>> URL = " >>> https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv >>> " >>> data = pd.read_csv(URL, usecols=['V1','V2']) >>> X, y = data[['V1']], data[['V2']] >>> >>> (data_train, data_test, >>> X_train, X_test, >>> y_train, y_test) = train_test_split(data, X, y) >>> >>> """ >>> Parameters setup >>> """ >>> >>> dbscan__eps = 0.06 >>> >>> mclust__n_components = 3 >>> >>> paella__noise_label = -1 >>> paella__max_it = 20, >>> paella__regular_size = 400, >>> paella__minimum_size = 100, >>> paella__width_r = 0.99, >>> paella__n_neighbors = 5, >>> paella__power = 30, >>> paella__random_state = None >>> >>> #%% >>> """ >>> DBSCAN clustering to detect noise suspects (label == -1) >>> """ >>> >>> dbscan_input = data_train >>> >>> dbscan_clustering = DBSCAN(eps = dbscan__eps) >>> >>> dbscan_output = dbscan_clustering.fit_predict(dbscan_input) >>> >>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >>> c=np.int64(dbscan_output == -1)) >>> >>> #%% >>> """ >>> GaussianMixture fitted with filtered data_train in order to help locate >>> the ellipsoids >>> but predict is applied to the whole data_train set. >>> """ >>> >>> mclust_input = data_train[ dbscan_output != 1] >>> >>> mclust_clustering = GaussianMixture(n_components = mclust__n_components) >>> mclust_clustering.fit(mclust_input) >>> >>> mclust_output = mclust_clustering.predict(data_train) >>> >>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >>> c=mclust_output) >>> >>> #%% >>> """ >>> mclust and dbscan results are combined. >>> """ >>> >>> clustering_output = mclust_output.copy() >>> clustering_output[dbscan_output == -1] = -1 >>> >>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1, >>> c=clustering_output) >>> >>> #%% >>> """ >>> Old-good Paella paper: >>> https://link.springer.com/article/10.1023/B:DAMI.0000031630.50685.7c >>> >>> The Paella algorithm calculates sample_weight to be used by the final >>> step regressor >>> (Yes, it is an outlier detection algorithm but we are focusing now on >>> this interesting collateral result). I am currently aggressively changing >>> the code in order to make it fit somehow with the pipeline >>> """ >>> >>> from paella import Paella >>> >>> paella_input = pd.concat([data, clustering_output], axis=1, >>> inplace=False) >>> >>> paella_run = Paella(noise_label = paella__noise_label, >>> max_it = paella__max_it, >>> regular_size = paella__regular_size, >>> minimum_size = paella__minimum_size, >>> width_r = paella__width_r, >>> n_neighbors = paella__n_neighbors, >>> power = paella__power, >>> random_state = paella__random_state) >>> >>> paella_output = paella_run.fit_predict(paella_input, y_train) >>> # paella_output is a vector with sample_weight >>> >>> #%% >>> """ >>> Here we fit a regressor using sample_weight=paella_output >>> """ >>> from sklearn.linear_model import LinearRegression >>> >>> regressor_input=X_train >>> lm = LinearRegression() >>> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output) >>> regressor_output = lm.predict(X_train) >>> >>> #... >>> >>> In this example we can see that: >>> - A particular step might need results produced not necessarily from the >>> immediately previous step. >>> - The X parameter is not sequentially transformed. Sometimes we might >>> need to skip to a previous step >>> - y sometimes is the target, sometimes is not. For the regressor it is >>> indeed, but for the paella algorithm the prediction is expressed as a >>> vector representing sample_weights. >>> >>> All in all the conclusion is that the chain of processes is not as >>> linear as imposed by the current API. I guess that all these difficulties >>> could be solved by: >>> - Passing a dictionary through the different steps containing the >>> partial results that the following steps will need. >>> - As a christmas gift :-) , a reference to the pipeline itself inserted >>> in that dictionary could provide access to the internal status of the >>> previous steps should it be needed. >>> >>> Another interesting study case with similar needs would be a regressor >>> using a previous clustering step in order to fit one model per cluster. In >>> such case, the clustering results would be needed during the fitting. >>> >>> >>> Thanks for your interest! >>> Manolo >>> >>> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn