Hello community, here is the log from the commit of package python-sklearn-pandas for openSUSE:Factory checked in at 2020-10-25 18:06:34 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/python-sklearn-pandas (Old) and /work/SRC/openSUSE:Factory/.python-sklearn-pandas.new.3463 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-sklearn-pandas" Sun Oct 25 18:06:34 2020 rev:5 rq:841146 version:2.0.2 Changes: -------- --- /work/SRC/openSUSE:Factory/python-sklearn-pandas/python-sklearn-pandas.changes 2020-01-07 23:52:07.083992834 +0100 +++ /work/SRC/openSUSE:Factory/.python-sklearn-pandas.new.3463/python-sklearn-pandas.changes 2020-10-25 18:06:44.367352067 +0100 @@ -1,0 +2,30 @@ +Sat Oct 10 19:08:06 UTC 2020 - Arun Persaud <a...@gmx.de> + +- specfile: + * updated versions of required packages + +- update to version 2.0.2: + * Fix DataFrameMapper drop_cols attribute naming consistency with + scikit-learn and initialization. + +- changes from version 2.0.1: + * Added an option to explicitly drop columns. + +- changes from version 2.0.0: + * Deprecated support for Python < 3.6. + * Deprecated support for old versions of scikit-learn, pandas and + numpy. Please check setup.py for minimum requirement. + * Removed CategoricalImputer, cross_val_score and GridSearchCV. All + these functionality now exists as part of scikit-learn. Please use + SimpleImputer instead of CategoricalImputer. Also Cross validation + from sklearn now supports dataframe so we don't need to use cross + validation wrapper provided over here. + * Added NumericalTransformer for common numerical + transformations. Currently it implements log and log1p + transformation. + * Added prefix and suffix options. See examples above. These are + usually helpful when using gen_features. + * Added drop_cols argument to DataframeMapper. This can be used to + explicitly drop columns + +------------------------------------------------------------------- Old: ---- sklearn-pandas-1.8.0.tar.gz New: ---- sklearn-pandas-2.0.2.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ python-sklearn-pandas.spec ++++++ --- /var/tmp/diff_new_pack.Xk17P1/_old 2020-10-25 18:06:45.675353305 +0100 +++ /var/tmp/diff_new_pack.Xk17P1/_new 2020-10-25 18:06:45.675353305 +0100 @@ -1,7 +1,7 @@ # # spec file for package python-sklearn-pandas # -# Copyright (c) 2020 SUSE LINUX GmbH, Nuernberg, Germany. +# Copyright (c) 2020 SUSE LLC # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed @@ -19,7 +19,7 @@ %{?!python_module:%define python_module() python-%{**} python3-%{**}} %define skip_python2 1 Name: python-sklearn-pandas -Version: 1.8.0 +Version: 2.0.2 Release: 0 Summary: Pandas integration with sklearn License: Zlib AND BSD-2-Clause @@ -29,18 +29,18 @@ BuildRequires: %{python_module setuptools} BuildRequires: fdupes BuildRequires: python-rpm-macros -Requires: python-numpy >= 1.6.1 -Requires: python-pandas >= 0.11.0 -Requires: python-scikit-learn >= 0.15.0 -Requires: python-scipy >= 0.14 +Requires: python-numpy >= 1.18.1 +Requires: python-pandas >= 1.0.5 +Requires: python-scikit-learn >= 0.23.0 +Requires: python-scipy >= 1.4.1 BuildArch: noarch # SECTION test requirements BuildRequires: %{python_module mock} -BuildRequires: %{python_module numpy >= 1.6.1} -BuildRequires: %{python_module pandas >= 0.11.0} +BuildRequires: %{python_module numpy >= 1.18.1} +BuildRequires: %{python_module pandas >= 1.0.5} BuildRequires: %{python_module pytest} -BuildRequires: %{python_module scikit-learn >= 0.15.0} -BuildRequires: %{python_module scipy >= 0.14} +BuildRequires: %{python_module scikit-learn >= 0.23.0} +BuildRequires: %{python_module scipy >= 1.4.1} # /SECTION %python_subpackages ++++++ sklearn-pandas-1.8.0.tar.gz -> sklearn-pandas-2.0.2.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/PKG-INFO new/sklearn-pandas-2.0.2/PKG-INFO --- old/sklearn-pandas-1.8.0/PKG-INFO 2018-12-01 20:14:57.000000000 +0100 +++ new/sklearn-pandas-2.0.2/PKG-INFO 2020-10-01 22:54:52.000000000 +0200 @@ -1,12 +1,11 @@ -Metadata-Version: 1.0 +Metadata-Version: 1.2 Name: sklearn-pandas -Version: 1.8.0 +Version: 2.0.2 Summary: Pandas integration with sklearn -Home-page: https://github.com/paulgb/sklearn-pandas -Author: Israel Saeta Pérez -Author-email: israel.sa...@dukebody.com +Home-page: https://github.com/scikit-learn-contrib/sklearn-pandas +Maintainer: Ritesh Agrawal +Maintainer-email: ragra...@gmail.com License: UNKNOWN -Description-Content-Type: UNKNOWN Description: UNKNOWN Keywords: scikit,sklearn,pandas Platform: UNKNOWN diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/README.rst new/sklearn-pandas-2.0.2/README.rst --- old/sklearn-pandas-1.8.0/README.rst 2018-12-01 20:13:37.000000000 +0100 +++ new/sklearn-pandas-2.0.2/README.rst 2020-10-01 22:35:05.000000000 +0200 @@ -2,16 +2,11 @@ Sklearn-pandas ============== -.. image:: https://circleci.com/gh/pandas-dev/sklearn-pandas.svg?style=svg - :target: https://circleci.com/gh/pandas-dev/sklearn-pandas +.. image:: https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas.svg?style=svg + :target: https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas This module provides a bridge between `Scikit-Learn <http://scikit-learn.org/stable>`__'s machine learning methods and `pandas <https://pandas.pydata.org>`__-style Data Frames. - -In particular, it provides: - -1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features. -2. A compatibility shim for old ``scikit-learn`` versions to cross-validate a pipeline that takes a pandas ``DataFrame`` as input. This is only needed for ``scikit-learn<0.16.0`` (see `#11 <https://github.com/paulgb/sklearn-pandas/issues/11>`__ for details). It is deprecated and will likely be dropped in ``skearn-pandas==2.0``. -3. A couple of special transformers that work well with pandas inputs: ``CategoricalImputer`` and ``FunctionTransformer`.` +In particular, it provides a way to map ``DataFrame`` columns to transformations, which are later recombined into features. Installation ------------ @@ -20,6 +15,7 @@ # pip install sklearn-pandas + Tests ----- @@ -36,11 +32,11 @@ Import what you need from the ``sklearn_pandas`` package. The choices are: * ``DataFrameMapper``, a class for mapping pandas data frame columns to different sklearn transformations -* ``cross_val_score``, similar to ``sklearn.cross_validation.cross_val_score`` but working on pandas DataFrames + For this demonstration, we will import both:: - >>> from sklearn_pandas import DataFrameMapper, cross_val_score + >>> from sklearn_pandas import DataFrameMapper For these examples, we'll also use pandas, numpy, and sklearn:: @@ -136,6 +132,16 @@ >>> mapper_alias.transformed_names_ ['children_scaled'] +Alternatively, you can also specify prefix and/or suffix to add to the column name. For example:: + + + >>> mapper_alias = DataFrameMapper([ + ... (['children'], sklearn.preprocessing.StandardScaler(), {'prefix': 'standard_scaled_'}), + ... (['children'], sklearn.preprocessing.StandardScaler(), {'suffix': '_raw'}) + ... ]) + >>> _ = mapper_alias.fit_transform(data.copy()) + >>> mapper_alias.transformed_names_ + ['standard_scaled_children', 'children_raw'] Passing Series/DataFrames to the transformers ********************************************* @@ -204,6 +210,32 @@ Note this does not work together with the ``default=True`` or ``sparse=True`` arguments to the mapper. +Dropping columns explictly +******************************* + +Sometimes it is required to drop a specific column/ list of columns. +For this purpose, ``drop_cols`` argument for ``DataFrameMapper`` can be used. +Default value is ``None`` + + >>> mapper_df = DataFrameMapper([ + ... ('pet', sklearn.preprocessing.LabelBinarizer()), + ... (['children'], sklearn.preprocessing.StandardScaler()) + ... ], drop_cols=['salary']) + +Now running ``fit_transform`` will run transformations on 'pet' and 'children' and drop 'salary' column: + + >>> np.round(mapper_df.fit_transform(data.copy()), 1) + array([[ 1. , 0. , 0. , 0.2], + [ 0. , 1. , 0. , 1.9], + [ 0. , 1. , 0. , -0.6], + [ 0. , 0. , 1. , -0.6], + [ 1. , 0. , 0. , -1.5], + [ 0. , 1. , 0. , -0.6], + [ 1. , 0. , 0. , 1. ], + [ 0. , 0. , 1. , 0.2]]) + +Transformations may require multiple input columns. In these + Transform Multiple Columns ************************** @@ -231,8 +263,9 @@ Multiple transformers can be applied to the same column specifying them in a list:: + >>> from sklearn.impute import SimpleImputer >>> mapper3 = DataFrameMapper([ - ... (['age'], [sklearn.preprocessing.Imputer(), + ... (['age'], [SimpleImputer(), ... sklearn.preprocessing.StandardScaler()])]) >>> data_3 = pd.DataFrame({'age': [1, np.nan, 3]}) >>> mapper3.fit_transform(data_3) @@ -302,7 +335,7 @@ ... classes=[sklearn.preprocessing.LabelEncoder] ... ) >>> feature_def - [('col1', [LabelEncoder()]), ('col2', [LabelEncoder()]), ('col3', [LabelEncoder()])] + [('col1', [LabelEncoder()], {}), ('col2', [LabelEncoder()], {}), ('col3', [LabelEncoder()], {})] >>> mapper5 = DataFrameMapper(feature_def) >>> data5 = pd.DataFrame({ ... 'col1': ['yes', 'no', 'yes'], @@ -318,23 +351,42 @@ transformer parameters should be provided. For example, consider a dataset with missing values. Then the following code could be used to override default imputing strategy: + >>> from sklearn.impute import SimpleImputer + >>> import numpy as np >>> feature_def = gen_features( ... columns=[['col1'], ['col2'], ['col3']], - ... classes=[{'class': sklearn.preprocessing.Imputer, 'strategy': 'most_frequent'}] + ... classes=[{'class': SimpleImputer, 'strategy':'most_frequent'}] ... ) >>> mapper6 = DataFrameMapper(feature_def) >>> data6 = pd.DataFrame({ - ... 'col1': [None, 1, 1, 2, 3], - ... 'col2': [True, False, None, None, True], - ... 'col3': [0, 0, 0, None, None] + ... 'col1': [np.nan, 1, 1, 2, 3], + ... 'col2': [True, False, np.nan, np.nan, True], + ... 'col3': [0, 0, 0, np.nan, np.nan] ... }) >>> mapper6.fit_transform(data6) - array([[1., 1., 0.], - [1., 0., 0.], - [1., 1., 0.], - [2., 1., 0.], - [3., 1., 0.]]) + array([[1.0, True, 0.0], + [1.0, False, 0.0], + [1.0, True, 0.0], + [2.0, True, 0.0], + [3.0, True, 0.0]], dtype=object) +You can also specify global prefix or suffix for the generated transformed column names using the prefix and suffix +parameters:: + + >>> feature_def = gen_features( + ... columns=['col1', 'col2', 'col3'], + ... classes=[sklearn.preprocessing.LabelEncoder], + ... prefix="lblencoder_" + ... ) + >>> mapper5 = DataFrameMapper(feature_def) + >>> data5 = pd.DataFrame({ + ... 'col1': ['yes', 'no', 'yes'], + ... 'col2': [True, False, False], + ... 'col3': ['one', 'two', 'three'] + ... }) + >>> _ = mapper5.fit_transform(data5) + >>> mapper5.transformed_names_ + ['lblencoder_col1', 'lblencoder_col2', 'lblencoder_col3'] Feature selection and other supervised transformations ****************************************************** @@ -356,7 +408,8 @@ Working with sparse features **************************** -A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return a sparse array whenever any of the extracted features is sparse. Example: +A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return +a sparse array whenever any of the extracted features is sparse. Example: >>> mapper5 = DataFrameMapper([ ... ('pet', CountVectorizer()), @@ -366,87 +419,89 @@ The stacking of the sparse features is done without ever densifying them. -Cross-Validation -**************** -Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. ``scikit-learn<0.16.0`` provided features for cross-validation, but they expect numpy data structures and won't work with ``DataFrameMapper``. +Using ``NumericalTransformer`` +*********************************** -To get around this, sklearn-pandas provides a wrapper on sklearn's ``cross_val_score`` function which passes a pandas DataFrame to the estimator rather than a numpy array:: +While you can use ``FunctionTransformation`` to generate arbitrary transformers, it can present serialization issues +when pickling. Use ``NumericalTransformer`` instead, which takes the function name as a string parameter and hence +can be easily serialized. - >>> pipe = sklearn.pipeline.Pipeline([ - ... ('featurize', mapper), - ... ('lm', sklearn.linear_model.LinearRegression())]) - >>> np.round(cross_val_score(pipe, X=data.copy(), y=data.salary, scoring='r2'), 2) - array([ -1.09, -5.3 , -15.38]) - -Sklearn-pandas' ``cross_val_score`` function provides exactly the same interface as sklearn's function of the same name. - -``CategoricalImputer`` -********************** - -Since the ``scikit-learn`` ``Imputer`` transformer currently only works with -numbers, ``sklearn-pandas`` provides an equivalent helper transformer that -works with strings, substituting null values with the most frequent value in -that column. Alternatively, you can specify a fixed value to use. + >>> from sklearn_pandas import NumericalTransformer + >>> mapper5 = DataFrameMapper([ + ... ('children', NumericalTransformer('log')), + ... ]) + >>> mapper5.fit_transform(data) + array([[1.38629436], + [1.79175947], + [1.09861229], + [1.09861229], + [0.69314718], + [1.09861229], + [1.60943791], + [1.38629436]]) -Example: imputing with the mode: - >>> from sklearn_pandas import CategoricalImputer - >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object) - >>> imputer = CategoricalImputer() - >>> imputer.fit_transform(data) - array(['a', 'b', 'b', 'b'], dtype=object) -Example: imputing with a fixed value: +Changelog +--------- +2.0.2 (2020-10-01) +****************** - >>> from sklearn_pandas import CategoricalImputer - >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object) - >>> imputer = CategoricalImputer(strategy='constant', fill_value='a') - >>> imputer.fit_transform(data) - array(['a', 'b', 'b', 'a'], dtype=object) +* Fix `DataFrameMapper` drop_cols attribute naming consistency with scikit-learn and initialization. -``FunctionTransformer`` -*********************** +2.0.1 (2020-09-07) +****************** -Often one wants to apply simple transformations to data such as ``np.log``. ``FunctionTransformer`` is a simple wrapper that takes any function and applies vectorization so that it can be used as a transformer. +* Added an option to explicitly drop columns. -Example: - >>> from sklearn_pandas import FunctionTransformer - >>> array = np.array([10, 100]) - >>> transformer = FunctionTransformer(np.log10) +2.0.0 (2020-08-01) +****************** - >>> transformer.fit_transform(array) - array([1., 2.]) +* Deprecated support for Python < 3.6. +* Deprecated support for old versions of scikit-learn, pandas and numpy. Please check setup.py for minimum requirement. +* Removed CategoricalImputer, cross_val_score and GridSearchCV. All these functionality now exists as part of + scikit-learn. Please use SimpleImputer instead of CategoricalImputer. Also + Cross validation from sklearn now supports dataframe so we don't need to use cross validation wrapper provided over + here. +* Added ``NumericalTransformer`` for common numerical transformations. Currently it implements log and log1p + transformation. +* Added prefix and suffix options. See examples above. These are usually helpful when using gen_features. +* Added ``drop_cols`` argument to DataframeMapper. This can be used to explicitly drop columns -Changelog ---------- 1.8.0 (2018-12-01) ****************** + * Add ``FunctionTransformer`` class (#117). * Fix column names derivation for dataframes with multi-index or non-string columns (#166). * Change behaviour of DataFrameMapper's fit_transform method to invoke each underlying transformers' native fit_transform if implemented. (#150) + 1.7.0 (2018-08-15) ****************** + * Fix issues with unicode names in ``get_names`` (#160). * Update to build using ``numpy==1.14`` and ``python==3.6`` (#154). * Add ``strategy`` and ``fill_value`` parameters to ``CategoricalImputer`` to allow imputing with values other than the mode (#144), (#161). * Preserve input data types when no transform is supplied (#138). + 1.6.0 (2017-10-28) ****************** + * Add column name to exception during fit/transform (#110). * Add ``gen_feature`` helper function to help generating the same transformation for multiple columns (#126). 1.5.0 (2017-06-24) ****************** + * Allow inputting a dataframe/series per group of columns. * Get feature names also from ``estimator.get_feature_names()`` if present. * Attempt to derive feature names from individual transformers when applying a @@ -457,6 +512,7 @@ 1.4.0 (2017-05-13) ****************** + * Allow specifying a custom name (alias) for transformed columns (#83). * Capture output columns generated names in ``transformed_names_`` attribute (#78). * Add ``CategoricalImputer`` that replaces null-like values with the mode @@ -534,3 +590,5 @@ * Timothy Sweetser (@hacktuarial) * Vitaley Zaretskey (@vzaretsk) * Zac Stewart (@zacstewart) +* Parul Singh (@paro1234) +* Vincent Heusinkveld (@VHeusinkveld) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/setup.py new/sklearn-pandas-2.0.2/setup.py --- old/sklearn-pandas-1.8.0/setup.py 2016-04-03 13:14:44.000000000 +0200 +++ new/sklearn-pandas-2.0.2/setup.py 2020-09-07 03:30:35.000000000 +0200 @@ -32,16 +32,17 @@ setup(name='sklearn-pandas', version=__version__, description='Pandas integration with sklearn', - maintainer='Israel Saeta Pérez', - maintainer_email='israel.sa...@dukebody.com', - url='https://github.com/paulgb/sklearn-pandas', + maintainer='Ritesh Agrawal', + maintainer_email='ragra...@gmail.com', + url='https://github.com/scikit-learn-contrib/sklearn-pandas', packages=['sklearn_pandas'], keywords=['scikit', 'sklearn', 'pandas'], install_requires=[ - 'scikit-learn>=0.15.0', - 'scipy>=0.14', - 'pandas>=0.11.0', - 'numpy>=1.6.1'], + 'scikit-learn>=0.23.0', + 'scipy>=1.4.1', + 'pandas>=1.0.5', + 'numpy>=1.18.1' + ], tests_require=['pytest', 'mock'], cmdclass={'test': PyTest}, ) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/__init__.py new/sklearn-pandas-2.0.2/sklearn_pandas/__init__.py --- old/sklearn-pandas-1.8.0/sklearn_pandas/__init__.py 2018-12-01 20:13:33.000000000 +0100 +++ new/sklearn-pandas-2.0.2/sklearn_pandas/__init__.py 2020-10-01 22:35:05.000000000 +0200 @@ -1,6 +1,5 @@ -__version__ = '1.8.0' +__version__ = '2.0.2' from .dataframe_mapper import DataFrameMapper # NOQA -from .cross_validation import cross_val_score, GridSearchCV, RandomizedSearchCV # NOQA -from .transformers import CategoricalImputer, FunctionTransformer # NOQA from .features_generator import gen_features # NOQA +from .transformers import NumericalTransformer # NOQA diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/categorical_imputer.py new/sklearn-pandas-2.0.2/sklearn_pandas/categorical_imputer.py --- old/sklearn-pandas-1.8.0/sklearn_pandas/categorical_imputer.py 2018-10-21 12:55:27.000000000 +0200 +++ new/sklearn-pandas-2.0.2/sklearn_pandas/categorical_imputer.py 1970-01-01 01:00:00.000000000 +0100 @@ -1,134 +0,0 @@ -import pandas as pd -import numpy as np - - -from sklearn.base import BaseEstimator, TransformerMixin -from sklearn.utils.validation import check_is_fitted - - -def _get_mask(X, value): - """ - Compute the boolean mask X == missing_values. - """ - if value == "NaN" or \ - value is None or \ - (isinstance(value, float) and np.isnan(value)): - return pd.isnull(X) - else: - return X == value - - -class CategoricalImputer(BaseEstimator, TransformerMixin): - """ - Impute missing values from a categorical/string np.ndarray or pd.Series - with the most frequent value on the training data. - - Parameters - ---------- - missing_values : string or "NaN", optional (default="NaN") - The placeholder for the missing values. All occurrences of - `missing_values` will be imputed. None and np.nan are treated - as being the same, use the string value "NaN" for them. - - copy : boolean, optional (default=True) - If True, a copy of X will be created. - - strategy : string, optional (default = 'most_frequent') - The imputation strategy. - - - If "most_frequent", then replace missing using the most frequent - value along each column. Can be used with strings or numeric data. - - If "constant", then replace missing values with fill_value. Can be - used with strings or numeric data. - - fill_value : string, optional (default='?') - The value that all instances of `missing_values` are replaced - with if `strategy` is set to `constant`. This is useful if - you don't want to impute with the mode, or if there are multiple - modes in your data and you want to choose a particular one. If - `strategy` is not set to `constant`, this parameter is ignored. - - Attributes - ---------- - fill_ : str - The imputation fill value - - """ - - def __init__( - self, - missing_values='NaN', - strategy='most_frequent', - fill_value='?', - copy=True - ): - self.missing_values = missing_values - self.copy = copy - self.fill_value = fill_value - self.strategy = strategy - - strategies = ['constant', 'most_frequent'] - if self.strategy not in strategies: - raise ValueError( - 'Strategy {0} not in {1}'.format(self.strategy, strategies) - ) - - def fit(self, X, y=None): - """ - - Get the most frequent value. - - Parameters - ---------- - X : np.ndarray or pd.Series - Training data. - - y : Passthrough for ``Pipeline`` compatibility. - - Returns - ------- - self: CategoricalImputer - """ - - mask = _get_mask(X, self.missing_values) - X = X[~mask] - if self.strategy == 'most_frequent': - modes = pd.Series(X).mode() - elif self.strategy == 'constant': - modes = np.array([self.fill_value]) - if modes.shape[0] == 0: - raise ValueError('Data is empty or all values are null') - elif modes.shape[0] > 1: - raise ValueError('No value is repeated more than ' - 'once in the column') - else: - self.fill_ = modes[0] - - return self - - def transform(self, X): - """ - - Replaces missing values in the input data with the most frequent value - of the training data. - - Parameters - ---------- - X : np.ndarray or pd.Series - Data with values to be imputed. - - Returns - ------- - np.ndarray - Data with imputed values. - """ - - check_is_fitted(self, 'fill_') - - if self.copy: - X = X.copy() - - mask = _get_mask(X, self.missing_values) - X[mask] = self.fill_ - - return np.asarray(X) diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/cross_validation.py new/sklearn-pandas-2.0.2/sklearn_pandas/cross_validation.py --- old/sklearn-pandas-1.8.0/sklearn_pandas/cross_validation.py 2017-04-17 12:14:52.000000000 +0200 +++ new/sklearn-pandas-2.0.2/sklearn_pandas/cross_validation.py 2020-09-07 03:30:35.000000000 +0200 @@ -1,59 +1,3 @@ -import warnings -try: - from sklearn.model_selection import cross_val_score as sk_cross_val_score - from sklearn.model_selection import GridSearchCV as SKGridSearchCV - from sklearn.model_selection import RandomizedSearchCV as \ - SKRandomizedSearchCV -except ImportError: - from sklearn.cross_validation import cross_val_score as sk_cross_val_score - from sklearn.grid_search import GridSearchCV as SKGridSearchCV - from sklearn.grid_search import RandomizedSearchCV as SKRandomizedSearchCV - -DEPRECATION_MSG = ''' - Custom cross-validation compatibility shims are no longer needed for - scikit-learn>=0.16.0 and will be dropped in sklearn-pandas==2.0. -''' - - -def cross_val_score(model, X, *args, **kwargs): - warnings.warn(DEPRECATION_MSG, DeprecationWarning) - X = DataWrapper(X) - return sk_cross_val_score(model, X, *args, **kwargs) - - -class GridSearchCV(SKGridSearchCV): - - def __init__(self, *args, **kwargs): - warnings.warn(DEPRECATION_MSG, DeprecationWarning) - super(GridSearchCV, self).__init__(*args, **kwargs) - - def fit(self, X, *params, **kwparams): - return super(GridSearchCV, self).fit( - DataWrapper(X), *params, **kwparams) - - def predict(self, X, *params, **kwparams): - return super(GridSearchCV, self).predict( - DataWrapper(X), *params, **kwparams) - - -try: - class RandomizedSearchCV(SKRandomizedSearchCV): - - def __init__(self, *args, **kwargs): - warnings.warn(DEPRECATION_MSG, DeprecationWarning) - super(RandomizedSearchCV, self).__init__(*args, **kwargs) - - def fit(self, X, *params, **kwparams): - return super(RandomizedSearchCV, self).fit( - DataWrapper(X), *params, **kwparams) - - def predict(self, X, *params, **kwparams): - return super(RandomizedSearchCV, self).predict( - DataWrapper(X), *params, **kwparams) -except AttributeError: - pass - - class DataWrapper(object): def __init__(self, df): diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/dataframe_mapper.py new/sklearn-pandas-2.0.2/sklearn_pandas/dataframe_mapper.py --- old/sklearn-pandas-1.8.0/sklearn_pandas/dataframe_mapper.py 2018-08-15 14:42:44.000000000 +0200 +++ new/sklearn-pandas-2.0.2/sklearn_pandas/dataframe_mapper.py 2020-10-01 22:35:05.000000000 +0200 @@ -1,4 +1,3 @@ -import sys import contextlib import pandas as pd @@ -9,12 +8,7 @@ from .cross_validation import DataWrapper from .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline -PY3 = sys.version_info[0] == 3 -if PY3: - string_types = text_type = str -else: - string_types = basestring # noqa - text_type = unicode # noqa +string_types = text_type = str def _handle_feature(fea): @@ -69,7 +63,7 @@ """ def __init__(self, features, default=False, sparse=False, df_out=False, - input_df=False): + input_df=False, drop_cols=None): """ Params: @@ -77,7 +71,7 @@ The first element is the pandas column selector. This can be a string (for one column) or a list of strings. The second element is an object that supports - sklearn's transform interface, or a list of such objects. + sklearn's transform interface, or a list of such objects The third element is optional and, if present, must be a dictionary with the options to apply to the transformation. Example: {'alias': 'day_of_week'} @@ -101,14 +95,17 @@ input_df If ``True`` pass the selected columns to the transformers as a pandas DataFrame or Series. Otherwise pass them as a numpy array. Defaults to ``False``. + + drop_cols List of columns to be dropped. Defaults to None. + """ self.features = features - self.built_features = None self.default = default self.built_default = None self.sparse = sparse self.df_out = df_out self.input_df = input_df + self.drop_cols = [] if drop_cols is None else drop_cols self.transformed_names_ = [] if (df_out and (sparse or default)): @@ -149,7 +146,8 @@ """ X_columns = list(X.columns) return [column for column in X_columns if - column not in self._selected_columns] + column not in self._selected_columns + and column not in self.drop_cols] def __setstate__(self, state): # compatibility for older versions of sklearn-pandas @@ -158,6 +156,7 @@ self.default = state.get('default', False) self.df_out = state.get('df_out', False) self.input_df = state.get('input_df', False) + self.drop_cols = state.get('drop_cols', []) self.built_features = state.get('built_features', self.features) self.built_default = state.get('built_default', self.default) self.transformed_names_ = state.get('transformed_names_', []) @@ -209,7 +208,6 @@ """ self._build() - for columns, transformers, options in self.built_features: input_df = options.get('input_df', self.input_df) @@ -226,7 +224,8 @@ _call_fit(self.built_default.fit, Xt, y) return self - def get_names(self, columns, transformer, x, alias=None): + def get_names(self, columns, transformer, x, alias=None, prefix='', + suffix=''): """ Return verbose names for the transformed columns. @@ -242,6 +241,9 @@ else: name = columns num_cols = x.shape[1] if len(x.shape) > 1 else 1 + + output = [] + if num_cols > 1: # If there are as many columns as classes in the transformer, # infer column names from classes names. @@ -257,13 +259,19 @@ # Otherwise use the only estimator present else: names = _get_feature_names(transformer) + if names is not None and len(names) == num_cols: - return ['%s_%s' % (name, o) for o in names] - # otherwise, return name concatenated with '_1', '_2', etc. + output = [f"{name}_{o}" for o in names] + # otherwise, return name concatenated with '_1', '_2', etc. else: - return [name + '_' + str(o) for o in range(num_cols)] + output = [name + '_' + str(o) for o in range(num_cols)] else: - return [name] + output = [name] + + if prefix == suffix == "": + return output + + return ['{}{}{}'.format(prefix, x, suffix) for x in output] def get_dtypes(self, extracted): dtypes_features = [self.get_dtype(ex) for ex in extracted] @@ -307,8 +315,11 @@ extracted.append(_handle_feature(Xt)) alias = options.get('alias') + prefix = options.get('prefix', '') + suffix = options.get('suffix', '') + self.transformed_names_ += self.get_names( - columns, transformers, Xt, alias) + columns, transformers, Xt, alias, prefix, suffix) # handle features not explicitly selected if self.built_default is not False: diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/features_generator.py new/sklearn-pandas-2.0.2/sklearn_pandas/features_generator.py --- old/sklearn-pandas-1.8.0/sklearn_pandas/features_generator.py 2017-10-22 19:58:20.000000000 +0200 +++ new/sklearn-pandas-2.0.2/sklearn_pandas/features_generator.py 2020-09-07 03:30:35.000000000 +0200 @@ -1,4 +1,4 @@ -def gen_features(columns, classes=None): +def gen_features(columns, classes=None, prefix='', suffix=''): """Generates a feature definition list which can be passed into DataFrameMapper @@ -25,6 +25,10 @@ If None value selected, then each feature left as is. + prefix add prefix to transformed column names + + suffix add suffix to transformed column names. + """ if classes is None: return [(column, None) for column in columns] @@ -34,9 +38,15 @@ for column in columns: feature_transformers = [] + arguments = {} + if prefix and prefix != "": + arguments['prefix'] = prefix + if suffix and suffix != "": + arguments['suffix'] = suffix + classes = [cls for cls in classes if cls is not None] if not classes: - feature_defs.append((column, None)) + feature_defs.append((column, None, arguments)) else: for definition in classes: @@ -50,6 +60,6 @@ if not feature_transformers: feature_transformers = None - feature_defs.append((column, feature_transformers)) + feature_defs.append((column, feature_transformers, arguments)) return feature_defs diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/transformers.py new/sklearn-pandas-2.0.2/sklearn_pandas/transformers.py --- old/sklearn-pandas-1.8.0/sklearn_pandas/transformers.py 2018-12-01 20:13:29.000000000 +0100 +++ new/sklearn-pandas-2.0.2/sklearn_pandas/transformers.py 2020-09-07 03:30:35.000000000 +0200 @@ -1,8 +1,6 @@ import numpy as np import pandas as pd - -from sklearn.base import BaseEstimator, TransformerMixin -from sklearn.utils.validation import check_is_fitted +from sklearn.base import TransformerMixin def _get_mask(X, value): @@ -17,136 +15,33 @@ return X == value -class CategoricalImputer(BaseEstimator, TransformerMixin): +class NumericalTransformer(TransformerMixin): """ - Impute missing values from a categorical/string np.ndarray or pd.Series - with the most frequent value on the training data. - - Parameters - ---------- - missing_values : string or "NaN", optional (default="NaN") - The placeholder for the missing values. All occurrences of - `missing_values` will be imputed. None and np.nan are treated - as being the same, use the string value "NaN" for them. - - copy : boolean, optional (default=True) - If True, a copy of X will be created. - - strategy : string, optional (default = 'most_frequent') - The imputation strategy. - - - If "most_frequent", then replace missing using the most frequent - value along each column. Can be used with strings or numeric data. - - If "constant", then replace missing values with fill_value. Can be - used with strings or numeric data. - - fill_value : string, optional (default='?') - The value that all instances of `missing_values` are replaced - with if `strategy` is set to `constant`. This is useful if - you don't want to impute with the mode, or if there are multiple - modes in your data and you want to choose a particular one. If - `strategy` is not set to `constant`, this parameter is ignored. - - Attributes - ---------- - fill_ : str - The imputation fill value - + Provides commonly used numerical transformers. """ + SUPPORTED_FUNCTIONS = ['log', 'log1p'] - def __init__( - self, - missing_values='NaN', - strategy='most_frequent', - fill_value='?', - copy=True - ): - self.missing_values = missing_values - self.copy = copy - self.fill_value = fill_value - self.strategy = strategy - - strategies = ['constant', 'most_frequent'] - if self.strategy not in strategies: - raise ValueError( - 'Strategy {0} not in {1}'.format(self.strategy, strategies) - ) - - def fit(self, X, y=None): - """ - - Get the most frequent value. - - Parameters - ---------- - X : np.ndarray or pd.Series - Training data. - - y : Passthrough for ``Pipeline`` compatibility. - - Returns - ------- - self: CategoricalImputer - """ - - mask = _get_mask(X, self.missing_values) - X = X[~mask] - if self.strategy == 'most_frequent': - modes = pd.Series(X).mode() - elif self.strategy == 'constant': - modes = np.array([self.fill_value]) - if modes.shape[0] == 0: - raise ValueError('Data is empty or all values are null') - elif modes.shape[0] > 1: - raise ValueError('No value is repeated more than ' - 'once in the column') - else: - self.fill_ = modes[0] - - return self - - def transform(self, X): + def __init__(self, func): """ + Params - Replaces missing values in the input data with the most frequent value - of the training data. - - Parameters - ---------- - X : np.ndarray or pd.Series - Data with values to be imputed. - - Returns - ------- - np.ndarray - Data with imputed values. + func function to apply to input columns. The function will be + applied to each value. Supported functions are defined + in SUPPORTED_FUNCTIONS variable. Throws assertion error if the + not supported. """ - - check_is_fitted(self, 'fill_') - - if self.copy: - X = X.copy() - - mask = _get_mask(X, self.missing_values) - X[mask] = self.fill_ - - return np.asarray(X) - - -class FunctionTransformer(BaseEstimator, TransformerMixin): - """ - Use this class to convert a random function into a - transformer. - """ - - def __init__(self, func): + assert func in self.SUPPORTED_FUNCTIONS, \ + f"Only following func are supported: {self.SUPPORTED_FUNCTIONS}" + super(NumericalTransformer, self).__init__() self.__func = func - def fit(self, x, y=None): + def fit(self, X, y=None): return self - def transform(self, x): - return np.vectorize(self.__func)(x) + def transform(self, X, y=None): + if self.__func == 'log1p': + return np.vectorize(np.log1p)(X) + elif self.__func == 'log': + return np.vectorize(np.log)(X) - def __call__(self, *args, **kwargs): - return self.__func(*args, **kwargs) + raise ValueError(f"Invalid function name: {self.__func}") diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/PKG-INFO new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/PKG-INFO --- old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/PKG-INFO 2018-12-01 20:14:57.000000000 +0100 +++ new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/PKG-INFO 2020-10-01 22:54:52.000000000 +0200 @@ -1,12 +1,11 @@ -Metadata-Version: 1.0 +Metadata-Version: 1.2 Name: sklearn-pandas -Version: 1.8.0 +Version: 2.0.2 Summary: Pandas integration with sklearn -Home-page: https://github.com/paulgb/sklearn-pandas -Author: Israel Saeta Pérez -Author-email: israel.sa...@dukebody.com +Home-page: https://github.com/scikit-learn-contrib/sklearn-pandas +Maintainer: Ritesh Agrawal +Maintainer-email: ragra...@gmail.com License: UNKNOWN -Description-Content-Type: UNKNOWN Description: UNKNOWN Keywords: scikit,sklearn,pandas Platform: UNKNOWN diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/SOURCES.txt new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/SOURCES.txt --- old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/SOURCES.txt 2018-12-01 20:14:57.000000000 +0100 +++ new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/SOURCES.txt 2020-10-01 22:54:52.000000000 +0200 @@ -4,7 +4,6 @@ setup.cfg setup.py sklearn_pandas/__init__.py -sklearn_pandas/categorical_imputer.py sklearn_pandas/cross_validation.py sklearn_pandas/dataframe_mapper.py sklearn_pandas/features_generator.py diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/requires.txt new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/requires.txt --- old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/requires.txt 2018-12-01 20:14:57.000000000 +0100 +++ new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/requires.txt 2020-10-01 22:54:52.000000000 +0200 @@ -1,4 +1,4 @@ -scikit-learn>=0.15.0 -scipy>=0.14 -pandas>=0.11.0 -numpy>=1.6.1 +scikit-learn>=0.23.0 +scipy>=1.4.1 +pandas>=1.0.5 +numpy>=1.18.1