Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

Frédéric Bastien Sat, 30 Dec 2017 06:45:19 -0800

This start to look as the dask project. Do you know it?

Le mar. 26 déc. 2017 05:49, Manuel Castejón Limas <manuel.caste...@gmail.com>
a écrit :


> I'm elaborating on the graph idea. A dictionary to describe the graph, the
> networkx package to support the graph and run it in topological order; and
> some wrappers for scikit-learn models.
>
> I'm currently thinking on putting some more efforts into a contrib project.
>
> It could be something inspired by this example.
>
> Manolo
>
> #-------------------------------------------------
>
>
>
> graph_description = {
>               'First':
>                   {'operation': First_Step,
>                    'input': {'X':X, 'y':y}},
>
>               'Concatenate_Xy':
>                   {'operation': ConcatenateData_Step,
>                    'input': [('First', 'X'),
>                              ('First', 'y')]},
>
>               'Gaussian_Mixture':
>                   {'operation': Gaussian_Mixture_Step,
>                    'input': [('Concatenate_Xy', 'data')]},
>
>               'Dbscan':
>                   {'operation': Dbscan_Step,
>                    'input': [('Concatenate_Xy', 'data')]},
>
>               'CombineClustering':
>                   {'operation': CombineClustering_Step,
>                    'input': [('Dbscan', 'classification'),
>                              ('Gaussian_Mixture', 'classification')]},
>
>               'Paella':
>                   {'operation': Paella_Step,
>                    'input': [('First', 'X'),
>                              ('First', 'y'),
>                              ('Concatenate_Xy', 'data'),
>                              ('CombineClustering', 'classification')]},
>
>               'Regressor':
>                   {'operation': Regressor_Step,
>                    'input': [('First', 'X'),
>                              ('First', 'y'),
>                              ('Paella', 'sample_weight')]},
>
>               'Last':
>                   {'operation': Last_Step,
>                    'input': [('Regressor', 'regressor')]},
>
>              }
>
> #%%
> def create_graph(description):
>     cg = nx.DiGraph()
>     cg.add_nodes_from(description)
>     for current_name, info in description.items():
>         current_node = cg.node[current_name]
>         current_node['operation'] = info['operation']( graph = cg,
> node_name = current_name )
>         current_node['input']     = info['input']
>         if current_name != 'First':
>             for ascendant in set( name for name, attribute in
> info['input'] ):
>                 cg.add_edge(ascendant, current_name)
>
>     return cg
> #%%
> cg = create_graph(graph_description)
>
> node_pos = {'First'            : ( 0, 0),
>             'Concatenate_Xy'   : ( 2, 4),
>             'Gaussian_Mixture' : ( 6, 8),
>             'Dbscan'           : ( 6, 6),
>             'CombineClustering': ( 8, 7),
>             'Paella'           : (10, 2),
>             'Regressor'        : (12, 0),
>             'Last'             : (16, 0)
>             }
>
> nx.draw(cg, pos=node_pos, with_labels=True)
>
> #%%
>
> print("=========================")
> for name in nx.topological_sort(cg):
>     print("Running: ", name)
>     cg.node[name]['operation'].fit()
>
> print("=========================")
>
> ########################
>
>
>
>
>
> 2017-12-22 12:09 GMT+01:00 Manuel Castejón Limas <
> manuel.caste...@gmail.com>:
>
>> I'm currently thinking on a computational graph which can then be wrapped
>> as a pipeline like object ... I'll try yo make a toy example solving my
>> problem.
>>
>> El 20 dic. 2017 16:33, "Manuel Castejón Limas" <manuel.caste...@gmail.com>
>> escribió:
>>
>>> Thank you all for your interest!
>>>
>>> In order to clarify the case allow me to try to synthesize the spirit
>>> of what I'd like to put into the pipeline using this sequence of steps:
>>>
>>> #%%
>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>>
>>> from sklearn.cluster import DBSCAN
>>> from sklearn.mixture import GaussianMixture
>>> from sklearn.model_selection import train_test_split
>>>
>>> np.random.seed(seed=42)
>>>
>>> """
>>> Data preparation
>>> """
>>>
>>> URL = "
>>> https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv
>>> "
>>> data = pd.read_csv(URL, usecols=['V1','V2'])
>>> X, y = data[['V1']], data[['V2']]
>>>
>>> (data_train, data_test,
>>>  X_train, X_test,
>>>  y_train, y_test) = train_test_split(data, X, y)
>>>
>>> """
>>> Parameters setup
>>> """
>>>
>>> dbscan__eps = 0.06
>>>
>>> mclust__n_components = 3
>>>
>>> paella__noise_label = -1
>>> paella__max_it = 20,
>>> paella__regular_size = 400,
>>> paella__minimum_size = 100,
>>> paella__width_r = 0.99,
>>> paella__n_neighbors = 5,
>>> paella__power = 30,
>>> paella__random_state = None
>>>
>>> #%%
>>> """
>>> DBSCAN clustering to detect noise suspects (label == -1)
>>> """
>>>
>>> dbscan_input = data_train
>>>
>>> dbscan_clustering = DBSCAN(eps = dbscan__eps)
>>>
>>> dbscan_output = dbscan_clustering.fit_predict(dbscan_input)
>>>
>>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>>> c=np.int64(dbscan_output == -1))
>>>
>>> #%%
>>> """
>>> GaussianMixture fitted with filtered data_train in order to help locate
>>> the ellipsoids
>>> but predict is applied to the whole data_train set.
>>> """
>>>
>>> mclust_input = data_train[ dbscan_output != 1]
>>>
>>> mclust_clustering = GaussianMixture(n_components = mclust__n_components)
>>> mclust_clustering.fit(mclust_input)
>>>
>>> mclust_output = mclust_clustering.predict(data_train)
>>>
>>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>>> c=mclust_output)
>>>
>>> #%%
>>> """
>>> mclust and dbscan results are combined.
>>> """
>>>
>>> clustering_output = mclust_output.copy()
>>> clustering_output[dbscan_output == -1] =  -1
>>>
>>> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
>>> c=clustering_output)
>>>
>>> #%%
>>> """
>>> Old-good Paella paper:
>>> https://link.springer.com/article/10.1023/B:DAMI.0000031630.50685.7c
>>>
>>> The Paella algorithm calculates sample_weight to be used by the final
>>> step regressor
>>> (Yes, it is an outlier detection algorithm but we are focusing now on
>>> this interesting collateral result). I am currently aggressively changing
>>> the code in order to make it fit somehow with the pipeline
>>> """
>>>
>>> from paella import Paella
>>>
>>> paella_input = pd.concat([data, clustering_output], axis=1,
>>> inplace=False)
>>>
>>> paella_run = Paella(noise_label = paella__noise_label,
>>>                     max_it = paella__max_it,
>>>                     regular_size = paella__regular_size,
>>>                     minimum_size = paella__minimum_size,
>>>                     width_r = paella__width_r,
>>>                     n_neighbors = paella__n_neighbors,
>>>                     power = paella__power,
>>>                     random_state = paella__random_state)
>>>
>>> paella_output = paella_run.fit_predict(paella_input, y_train)
>>> # paella_output is a vector with sample_weight
>>>
>>> #%%
>>> """
>>> Here we fit a regressor using sample_weight=paella_output
>>> """
>>> from sklearn.linear_model import LinearRegression
>>>
>>> regressor_input=X_train
>>> lm = LinearRegression()
>>> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
>>> regressor_output = lm.predict(X_train)
>>>
>>> #...
>>>
>>> In this example we can see that:
>>> - A particular step might need results produced not necessarily from the
>>> immediately previous step.
>>> - The X parameter is not sequentially transformed. Sometimes we might
>>> need to skip to a previous step
>>> - y sometimes is the target, sometimes is not. For the regressor it is
>>> indeed, but for the paella algorithm the prediction is expressed as a
>>> vector representing sample_weights.
>>>
>>> All in all the conclusion is that the chain of processes is not as
>>> linear as imposed by the current API. I guess that all these difficulties
>>> could be solved by:
>>> - Passing a dictionary through the different steps containing the
>>> partial results that the following steps will need.
>>> -  As a christmas gift :-) , a reference to the pipeline itself inserted
>>> in that dictionary could provide access to the internal status of the
>>> previous steps should it be needed.
>>>
>>> Another interesting study case with similar needs would be a regressor
>>> using a previous clustering step in order to fit one model per cluster. In
>>> such case, the clustering results would be needed during the fitting.
>>>
>>>
>>> Thanks for your interest!
>>> Manolo
>>>
>>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

Reply via email to