Hello community,
here is the log from the commit of package python-sklearn-pandas for
openSUSE:Factory checked in at 2020-10-25 18:06:34
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-sklearn-pandas (Old)
and /work/SRC/openSUSE:Factory/.python-sklearn-pandas.new.3463 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "python-sklearn-pandas"
Sun Oct 25 18:06:34 2020 rev:5 rq:841146 version:2.0.2
Changes:
--------
---
/work/SRC/openSUSE:Factory/python-sklearn-pandas/python-sklearn-pandas.changes
2020-01-07 23:52:07.083992834 +0100
+++
/work/SRC/openSUSE:Factory/.python-sklearn-pandas.new.3463/python-sklearn-pandas.changes
2020-10-25 18:06:44.367352067 +0100
@@ -1,0 +2,30 @@
+Sat Oct 10 19:08:06 UTC 2020 - Arun Persaud <[email protected]>
+
+- specfile:
+ * updated versions of required packages
+
+- update to version 2.0.2:
+ * Fix DataFrameMapper drop_cols attribute naming consistency with
+ scikit-learn and initialization.
+
+- changes from version 2.0.1:
+ * Added an option to explicitly drop columns.
+
+- changes from version 2.0.0:
+ * Deprecated support for Python < 3.6.
+ * Deprecated support for old versions of scikit-learn, pandas and
+ numpy. Please check setup.py for minimum requirement.
+ * Removed CategoricalImputer, cross_val_score and GridSearchCV. All
+ these functionality now exists as part of scikit-learn. Please use
+ SimpleImputer instead of CategoricalImputer. Also Cross validation
+ from sklearn now supports dataframe so we don't need to use cross
+ validation wrapper provided over here.
+ * Added NumericalTransformer for common numerical
+ transformations. Currently it implements log and log1p
+ transformation.
+ * Added prefix and suffix options. See examples above. These are
+ usually helpful when using gen_features.
+ * Added drop_cols argument to DataframeMapper. This can be used to
+ explicitly drop columns
+
+-------------------------------------------------------------------
Old:
----
sklearn-pandas-1.8.0.tar.gz
New:
----
sklearn-pandas-2.0.2.tar.gz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Other differences:
------------------
++++++ python-sklearn-pandas.spec ++++++
--- /var/tmp/diff_new_pack.Xk17P1/_old 2020-10-25 18:06:45.675353305 +0100
+++ /var/tmp/diff_new_pack.Xk17P1/_new 2020-10-25 18:06:45.675353305 +0100
@@ -1,7 +1,7 @@
#
# spec file for package python-sklearn-pandas
#
-# Copyright (c) 2020 SUSE LINUX GmbH, Nuernberg, Germany.
+# Copyright (c) 2020 SUSE LLC
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
@@ -19,7 +19,7 @@
%{?!python_module:%define python_module() python-%{**} python3-%{**}}
%define skip_python2 1
Name: python-sklearn-pandas
-Version: 1.8.0
+Version: 2.0.2
Release: 0
Summary: Pandas integration with sklearn
License: Zlib AND BSD-2-Clause
@@ -29,18 +29,18 @@
BuildRequires: %{python_module setuptools}
BuildRequires: fdupes
BuildRequires: python-rpm-macros
-Requires: python-numpy >= 1.6.1
-Requires: python-pandas >= 0.11.0
-Requires: python-scikit-learn >= 0.15.0
-Requires: python-scipy >= 0.14
+Requires: python-numpy >= 1.18.1
+Requires: python-pandas >= 1.0.5
+Requires: python-scikit-learn >= 0.23.0
+Requires: python-scipy >= 1.4.1
BuildArch: noarch
# SECTION test requirements
BuildRequires: %{python_module mock}
-BuildRequires: %{python_module numpy >= 1.6.1}
-BuildRequires: %{python_module pandas >= 0.11.0}
+BuildRequires: %{python_module numpy >= 1.18.1}
+BuildRequires: %{python_module pandas >= 1.0.5}
BuildRequires: %{python_module pytest}
-BuildRequires: %{python_module scikit-learn >= 0.15.0}
-BuildRequires: %{python_module scipy >= 0.14}
+BuildRequires: %{python_module scikit-learn >= 0.23.0}
+BuildRequires: %{python_module scipy >= 1.4.1}
# /SECTION
%python_subpackages
++++++ sklearn-pandas-1.8.0.tar.gz -> sklearn-pandas-2.0.2.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/sklearn-pandas-1.8.0/PKG-INFO
new/sklearn-pandas-2.0.2/PKG-INFO
--- old/sklearn-pandas-1.8.0/PKG-INFO 2018-12-01 20:14:57.000000000 +0100
+++ new/sklearn-pandas-2.0.2/PKG-INFO 2020-10-01 22:54:52.000000000 +0200
@@ -1,12 +1,11 @@
-Metadata-Version: 1.0
+Metadata-Version: 1.2
Name: sklearn-pandas
-Version: 1.8.0
+Version: 2.0.2
Summary: Pandas integration with sklearn
-Home-page: https://github.com/paulgb/sklearn-pandas
-Author: Israel Saeta Pérez
-Author-email: [email protected]
+Home-page: https://github.com/scikit-learn-contrib/sklearn-pandas
+Maintainer: Ritesh Agrawal
+Maintainer-email: [email protected]
License: UNKNOWN
-Description-Content-Type: UNKNOWN
Description: UNKNOWN
Keywords: scikit,sklearn,pandas
Platform: UNKNOWN
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/sklearn-pandas-1.8.0/README.rst
new/sklearn-pandas-2.0.2/README.rst
--- old/sklearn-pandas-1.8.0/README.rst 2018-12-01 20:13:37.000000000 +0100
+++ new/sklearn-pandas-2.0.2/README.rst 2020-10-01 22:35:05.000000000 +0200
@@ -2,16 +2,11 @@
Sklearn-pandas
==============
-.. image:: https://circleci.com/gh/pandas-dev/sklearn-pandas.svg?style=svg
- :target: https://circleci.com/gh/pandas-dev/sklearn-pandas
+.. image::
https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas.svg?style=svg
+ :target: https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas
This module provides a bridge between `Scikit-Learn
<http://scikit-learn.org/stable>`__'s machine learning methods and `pandas
<https://pandas.pydata.org>`__-style Data Frames.
-
-In particular, it provides:
-
-1. A way to map ``DataFrame`` columns to transformations, which are later
recombined into features.
-2. A compatibility shim for old ``scikit-learn`` versions to cross-validate a
pipeline that takes a pandas ``DataFrame`` as input. This is only needed for
``scikit-learn<0.16.0`` (see `#11
<https://github.com/paulgb/sklearn-pandas/issues/11>`__ for details). It is
deprecated and will likely be dropped in ``skearn-pandas==2.0``.
-3. A couple of special transformers that work well with pandas inputs:
``CategoricalImputer`` and ``FunctionTransformer`.`
+In particular, it provides a way to map ``DataFrame`` columns to
transformations, which are later recombined into features.
Installation
------------
@@ -20,6 +15,7 @@
# pip install sklearn-pandas
+
Tests
-----
@@ -36,11 +32,11 @@
Import what you need from the ``sklearn_pandas`` package. The choices are:
* ``DataFrameMapper``, a class for mapping pandas data frame columns to
different sklearn transformations
-* ``cross_val_score``, similar to ``sklearn.cross_validation.cross_val_score``
but working on pandas DataFrames
+
For this demonstration, we will import both::
- >>> from sklearn_pandas import DataFrameMapper, cross_val_score
+ >>> from sklearn_pandas import DataFrameMapper
For these examples, we'll also use pandas, numpy, and sklearn::
@@ -136,6 +132,16 @@
>>> mapper_alias.transformed_names_
['children_scaled']
+Alternatively, you can also specify prefix and/or suffix to add to the column
name. For example::
+
+
+ >>> mapper_alias = DataFrameMapper([
+ ... (['children'], sklearn.preprocessing.StandardScaler(), {'prefix':
'standard_scaled_'}),
+ ... (['children'], sklearn.preprocessing.StandardScaler(), {'suffix':
'_raw'})
+ ... ])
+ >>> _ = mapper_alias.fit_transform(data.copy())
+ >>> mapper_alias.transformed_names_
+ ['standard_scaled_children', 'children_raw']
Passing Series/DataFrames to the transformers
*********************************************
@@ -204,6 +210,32 @@
Note this does not work together with the ``default=True`` or ``sparse=True``
arguments to the mapper.
+Dropping columns explictly
+*******************************
+
+Sometimes it is required to drop a specific column/ list of columns.
+For this purpose, ``drop_cols`` argument for ``DataFrameMapper`` can be used.
+Default value is ``None``
+
+ >>> mapper_df = DataFrameMapper([
+ ... ('pet', sklearn.preprocessing.LabelBinarizer()),
+ ... (['children'], sklearn.preprocessing.StandardScaler())
+ ... ], drop_cols=['salary'])
+
+Now running ``fit_transform`` will run transformations on 'pet' and 'children'
and drop 'salary' column:
+
+ >>> np.round(mapper_df.fit_transform(data.copy()), 1)
+ array([[ 1. , 0. , 0. , 0.2],
+ [ 0. , 1. , 0. , 1.9],
+ [ 0. , 1. , 0. , -0.6],
+ [ 0. , 0. , 1. , -0.6],
+ [ 1. , 0. , 0. , -1.5],
+ [ 0. , 1. , 0. , -0.6],
+ [ 1. , 0. , 0. , 1. ],
+ [ 0. , 0. , 1. , 0.2]])
+
+Transformations may require multiple input columns. In these
+
Transform Multiple Columns
**************************
@@ -231,8 +263,9 @@
Multiple transformers can be applied to the same column specifying them
in a list::
+ >>> from sklearn.impute import SimpleImputer
>>> mapper3 = DataFrameMapper([
- ... (['age'], [sklearn.preprocessing.Imputer(),
+ ... (['age'], [SimpleImputer(),
... sklearn.preprocessing.StandardScaler()])])
>>> data_3 = pd.DataFrame({'age': [1, np.nan, 3]})
>>> mapper3.fit_transform(data_3)
@@ -302,7 +335,7 @@
... classes=[sklearn.preprocessing.LabelEncoder]
... )
>>> feature_def
- [('col1', [LabelEncoder()]), ('col2', [LabelEncoder()]), ('col3',
[LabelEncoder()])]
+ [('col1', [LabelEncoder()], {}), ('col2', [LabelEncoder()], {}), ('col3',
[LabelEncoder()], {})]
>>> mapper5 = DataFrameMapper(feature_def)
>>> data5 = pd.DataFrame({
... 'col1': ['yes', 'no', 'yes'],
@@ -318,23 +351,42 @@
transformer parameters should be provided. For example, consider a dataset
with missing values.
Then the following code could be used to override default imputing strategy:
+ >>> from sklearn.impute import SimpleImputer
+ >>> import numpy as np
>>> feature_def = gen_features(
... columns=[['col1'], ['col2'], ['col3']],
- ... classes=[{'class': sklearn.preprocessing.Imputer, 'strategy':
'most_frequent'}]
+ ... classes=[{'class': SimpleImputer, 'strategy':'most_frequent'}]
... )
>>> mapper6 = DataFrameMapper(feature_def)
>>> data6 = pd.DataFrame({
- ... 'col1': [None, 1, 1, 2, 3],
- ... 'col2': [True, False, None, None, True],
- ... 'col3': [0, 0, 0, None, None]
+ ... 'col1': [np.nan, 1, 1, 2, 3],
+ ... 'col2': [True, False, np.nan, np.nan, True],
+ ... 'col3': [0, 0, 0, np.nan, np.nan]
... })
>>> mapper6.fit_transform(data6)
- array([[1., 1., 0.],
- [1., 0., 0.],
- [1., 1., 0.],
- [2., 1., 0.],
- [3., 1., 0.]])
+ array([[1.0, True, 0.0],
+ [1.0, False, 0.0],
+ [1.0, True, 0.0],
+ [2.0, True, 0.0],
+ [3.0, True, 0.0]], dtype=object)
+You can also specify global prefix or suffix for the generated transformed
column names using the prefix and suffix
+parameters::
+
+ >>> feature_def = gen_features(
+ ... columns=['col1', 'col2', 'col3'],
+ ... classes=[sklearn.preprocessing.LabelEncoder],
+ ... prefix="lblencoder_"
+ ... )
+ >>> mapper5 = DataFrameMapper(feature_def)
+ >>> data5 = pd.DataFrame({
+ ... 'col1': ['yes', 'no', 'yes'],
+ ... 'col2': [True, False, False],
+ ... 'col3': ['one', 'two', 'three']
+ ... })
+ >>> _ = mapper5.fit_transform(data5)
+ >>> mapper5.transformed_names_
+ ['lblencoder_col1', 'lblencoder_col2', 'lblencoder_col3']
Feature selection and other supervised transformations
******************************************************
@@ -356,7 +408,8 @@
Working with sparse features
****************************
-A ``DataFrameMapper`` will return a dense feature array by default. Setting
``sparse=True`` in the mapper will return a sparse array whenever any of the
extracted features is sparse. Example:
+A ``DataFrameMapper`` will return a dense feature array by default. Setting
``sparse=True`` in the mapper will return
+a sparse array whenever any of the extracted features is sparse. Example:
>>> mapper5 = DataFrameMapper([
... ('pet', CountVectorizer()),
@@ -366,87 +419,89 @@
The stacking of the sparse features is done without ever densifying them.
-Cross-Validation
-****************
-Now that we can combine features from pandas DataFrames, we may want to use
cross-validation to see whether our model works. ``scikit-learn<0.16.0``
provided features for cross-validation, but they expect numpy data structures
and won't work with ``DataFrameMapper``.
+Using ``NumericalTransformer``
+***********************************
-To get around this, sklearn-pandas provides a wrapper on sklearn's
``cross_val_score`` function which passes a pandas DataFrame to the estimator
rather than a numpy array::
+While you can use ``FunctionTransformation`` to generate arbitrary
transformers, it can present serialization issues
+when pickling. Use ``NumericalTransformer`` instead, which takes the function
name as a string parameter and hence
+can be easily serialized.
- >>> pipe = sklearn.pipeline.Pipeline([
- ... ('featurize', mapper),
- ... ('lm', sklearn.linear_model.LinearRegression())])
- >>> np.round(cross_val_score(pipe, X=data.copy(), y=data.salary,
scoring='r2'), 2)
- array([ -1.09, -5.3 , -15.38])
-
-Sklearn-pandas' ``cross_val_score`` function provides exactly the same
interface as sklearn's function of the same name.
-
-``CategoricalImputer``
-**********************
-
-Since the ``scikit-learn`` ``Imputer`` transformer currently only works with
-numbers, ``sklearn-pandas`` provides an equivalent helper transformer that
-works with strings, substituting null values with the most frequent value in
-that column. Alternatively, you can specify a fixed value to use.
+ >>> from sklearn_pandas import NumericalTransformer
+ >>> mapper5 = DataFrameMapper([
+ ... ('children', NumericalTransformer('log')),
+ ... ])
+ >>> mapper5.fit_transform(data)
+ array([[1.38629436],
+ [1.79175947],
+ [1.09861229],
+ [1.09861229],
+ [0.69314718],
+ [1.09861229],
+ [1.60943791],
+ [1.38629436]])
-Example: imputing with the mode:
- >>> from sklearn_pandas import CategoricalImputer
- >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
- >>> imputer = CategoricalImputer()
- >>> imputer.fit_transform(data)
- array(['a', 'b', 'b', 'b'], dtype=object)
-Example: imputing with a fixed value:
+Changelog
+---------
+2.0.2 (2020-10-01)
+******************
- >>> from sklearn_pandas import CategoricalImputer
- >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
- >>> imputer = CategoricalImputer(strategy='constant', fill_value='a')
- >>> imputer.fit_transform(data)
- array(['a', 'b', 'b', 'a'], dtype=object)
+* Fix `DataFrameMapper` drop_cols attribute naming consistency with
scikit-learn and initialization.
-``FunctionTransformer``
-***********************
+2.0.1 (2020-09-07)
+******************
-Often one wants to apply simple transformations to data such as ``np.log``.
``FunctionTransformer`` is a simple wrapper that takes any function and applies
vectorization so that it can be used as a transformer.
+* Added an option to explicitly drop columns.
-Example:
- >>> from sklearn_pandas import FunctionTransformer
- >>> array = np.array([10, 100])
- >>> transformer = FunctionTransformer(np.log10)
+2.0.0 (2020-08-01)
+******************
- >>> transformer.fit_transform(array)
- array([1., 2.])
+* Deprecated support for Python < 3.6.
+* Deprecated support for old versions of scikit-learn, pandas and numpy.
Please check setup.py for minimum requirement.
+* Removed CategoricalImputer, cross_val_score and GridSearchCV. All these
functionality now exists as part of
+ scikit-learn. Please use SimpleImputer instead of CategoricalImputer. Also
+ Cross validation from sklearn now supports dataframe so we don't need to use
cross validation wrapper provided over
+ here.
+* Added ``NumericalTransformer`` for common numerical transformations.
Currently it implements log and log1p
+ transformation.
+* Added prefix and suffix options. See examples above. These are usually
helpful when using gen_features.
+* Added ``drop_cols`` argument to DataframeMapper. This can be used to
explicitly drop columns
-Changelog
----------
1.8.0 (2018-12-01)
******************
+
* Add ``FunctionTransformer`` class (#117).
* Fix column names derivation for dataframes with multi-index or non-string
columns (#166).
* Change behaviour of DataFrameMapper's fit_transform method to invoke each
underlying transformers'
native fit_transform if implemented. (#150)
+
1.7.0 (2018-08-15)
******************
+
* Fix issues with unicode names in ``get_names`` (#160).
* Update to build using ``numpy==1.14`` and ``python==3.6`` (#154).
* Add ``strategy`` and ``fill_value`` parameters to ``CategoricalImputer`` to
allow imputing
with values other than the mode (#144), (#161).
* Preserve input data types when no transform is supplied (#138).
+
1.6.0 (2017-10-28)
******************
+
* Add column name to exception during fit/transform (#110).
* Add ``gen_feature`` helper function to help generating the same
transformation for multiple columns (#126).
1.5.0 (2017-06-24)
******************
+
* Allow inputting a dataframe/series per group of columns.
* Get feature names also from ``estimator.get_feature_names()`` if present.
* Attempt to derive feature names from individual transformers when applying a
@@ -457,6 +512,7 @@
1.4.0 (2017-05-13)
******************
+
* Allow specifying a custom name (alias) for transformed columns (#83).
* Capture output columns generated names in ``transformed_names_`` attribute
(#78).
* Add ``CategoricalImputer`` that replaces null-like values with the mode
@@ -534,3 +590,5 @@
* Timothy Sweetser (@hacktuarial)
* Vitaley Zaretskey (@vzaretsk)
* Zac Stewart (@zacstewart)
+* Parul Singh (@paro1234)
+* Vincent Heusinkveld (@VHeusinkveld)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/sklearn-pandas-1.8.0/setup.py
new/sklearn-pandas-2.0.2/setup.py
--- old/sklearn-pandas-1.8.0/setup.py 2016-04-03 13:14:44.000000000 +0200
+++ new/sklearn-pandas-2.0.2/setup.py 2020-09-07 03:30:35.000000000 +0200
@@ -32,16 +32,17 @@
setup(name='sklearn-pandas',
version=__version__,
description='Pandas integration with sklearn',
- maintainer='Israel Saeta Pérez',
- maintainer_email='[email protected]',
- url='https://github.com/paulgb/sklearn-pandas',
+ maintainer='Ritesh Agrawal',
+ maintainer_email='[email protected]',
+ url='https://github.com/scikit-learn-contrib/sklearn-pandas',
packages=['sklearn_pandas'],
keywords=['scikit', 'sklearn', 'pandas'],
install_requires=[
- 'scikit-learn>=0.15.0',
- 'scipy>=0.14',
- 'pandas>=0.11.0',
- 'numpy>=1.6.1'],
+ 'scikit-learn>=0.23.0',
+ 'scipy>=1.4.1',
+ 'pandas>=1.0.5',
+ 'numpy>=1.18.1'
+ ],
tests_require=['pytest', 'mock'],
cmdclass={'test': PyTest},
)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/__init__.py
new/sklearn-pandas-2.0.2/sklearn_pandas/__init__.py
--- old/sklearn-pandas-1.8.0/sklearn_pandas/__init__.py 2018-12-01
20:13:33.000000000 +0100
+++ new/sklearn-pandas-2.0.2/sklearn_pandas/__init__.py 2020-10-01
22:35:05.000000000 +0200
@@ -1,6 +1,5 @@
-__version__ = '1.8.0'
+__version__ = '2.0.2'
from .dataframe_mapper import DataFrameMapper # NOQA
-from .cross_validation import cross_val_score, GridSearchCV,
RandomizedSearchCV # NOQA
-from .transformers import CategoricalImputer, FunctionTransformer # NOQA
from .features_generator import gen_features # NOQA
+from .transformers import NumericalTransformer # NOQA
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/sklearn-pandas-1.8.0/sklearn_pandas/categorical_imputer.py
new/sklearn-pandas-2.0.2/sklearn_pandas/categorical_imputer.py
--- old/sklearn-pandas-1.8.0/sklearn_pandas/categorical_imputer.py
2018-10-21 12:55:27.000000000 +0200
+++ new/sklearn-pandas-2.0.2/sklearn_pandas/categorical_imputer.py
1970-01-01 01:00:00.000000000 +0100
@@ -1,134 +0,0 @@
-import pandas as pd
-import numpy as np
-
-
-from sklearn.base import BaseEstimator, TransformerMixin
-from sklearn.utils.validation import check_is_fitted
-
-
-def _get_mask(X, value):
- """
- Compute the boolean mask X == missing_values.
- """
- if value == "NaN" or \
- value is None or \
- (isinstance(value, float) and np.isnan(value)):
- return pd.isnull(X)
- else:
- return X == value
-
-
-class CategoricalImputer(BaseEstimator, TransformerMixin):
- """
- Impute missing values from a categorical/string np.ndarray or pd.Series
- with the most frequent value on the training data.
-
- Parameters
- ----------
- missing_values : string or "NaN", optional (default="NaN")
- The placeholder for the missing values. All occurrences of
- `missing_values` will be imputed. None and np.nan are treated
- as being the same, use the string value "NaN" for them.
-
- copy : boolean, optional (default=True)
- If True, a copy of X will be created.
-
- strategy : string, optional (default = 'most_frequent')
- The imputation strategy.
-
- - If "most_frequent", then replace missing using the most frequent
- value along each column. Can be used with strings or numeric data.
- - If "constant", then replace missing values with fill_value. Can be
- used with strings or numeric data.
-
- fill_value : string, optional (default='?')
- The value that all instances of `missing_values` are replaced
- with if `strategy` is set to `constant`. This is useful if
- you don't want to impute with the mode, or if there are multiple
- modes in your data and you want to choose a particular one. If
- `strategy` is not set to `constant`, this parameter is ignored.
-
- Attributes
- ----------
- fill_ : str
- The imputation fill value
-
- """
-
- def __init__(
- self,
- missing_values='NaN',
- strategy='most_frequent',
- fill_value='?',
- copy=True
- ):
- self.missing_values = missing_values
- self.copy = copy
- self.fill_value = fill_value
- self.strategy = strategy
-
- strategies = ['constant', 'most_frequent']
- if self.strategy not in strategies:
- raise ValueError(
- 'Strategy {0} not in {1}'.format(self.strategy, strategies)
- )
-
- def fit(self, X, y=None):
- """
-
- Get the most frequent value.
-
- Parameters
- ----------
- X : np.ndarray or pd.Series
- Training data.
-
- y : Passthrough for ``Pipeline`` compatibility.
-
- Returns
- -------
- self: CategoricalImputer
- """
-
- mask = _get_mask(X, self.missing_values)
- X = X[~mask]
- if self.strategy == 'most_frequent':
- modes = pd.Series(X).mode()
- elif self.strategy == 'constant':
- modes = np.array([self.fill_value])
- if modes.shape[0] == 0:
- raise ValueError('Data is empty or all values are null')
- elif modes.shape[0] > 1:
- raise ValueError('No value is repeated more than '
- 'once in the column')
- else:
- self.fill_ = modes[0]
-
- return self
-
- def transform(self, X):
- """
-
- Replaces missing values in the input data with the most frequent value
- of the training data.
-
- Parameters
- ----------
- X : np.ndarray or pd.Series
- Data with values to be imputed.
-
- Returns
- -------
- np.ndarray
- Data with imputed values.
- """
-
- check_is_fitted(self, 'fill_')
-
- if self.copy:
- X = X.copy()
-
- mask = _get_mask(X, self.missing_values)
- X[mask] = self.fill_
-
- return np.asarray(X)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/sklearn-pandas-1.8.0/sklearn_pandas/cross_validation.py
new/sklearn-pandas-2.0.2/sklearn_pandas/cross_validation.py
--- old/sklearn-pandas-1.8.0/sklearn_pandas/cross_validation.py 2017-04-17
12:14:52.000000000 +0200
+++ new/sklearn-pandas-2.0.2/sklearn_pandas/cross_validation.py 2020-09-07
03:30:35.000000000 +0200
@@ -1,59 +1,3 @@
-import warnings
-try:
- from sklearn.model_selection import cross_val_score as sk_cross_val_score
- from sklearn.model_selection import GridSearchCV as SKGridSearchCV
- from sklearn.model_selection import RandomizedSearchCV as \
- SKRandomizedSearchCV
-except ImportError:
- from sklearn.cross_validation import cross_val_score as sk_cross_val_score
- from sklearn.grid_search import GridSearchCV as SKGridSearchCV
- from sklearn.grid_search import RandomizedSearchCV as SKRandomizedSearchCV
-
-DEPRECATION_MSG = '''
- Custom cross-validation compatibility shims are no longer needed for
- scikit-learn>=0.16.0 and will be dropped in sklearn-pandas==2.0.
-'''
-
-
-def cross_val_score(model, X, *args, **kwargs):
- warnings.warn(DEPRECATION_MSG, DeprecationWarning)
- X = DataWrapper(X)
- return sk_cross_val_score(model, X, *args, **kwargs)
-
-
-class GridSearchCV(SKGridSearchCV):
-
- def __init__(self, *args, **kwargs):
- warnings.warn(DEPRECATION_MSG, DeprecationWarning)
- super(GridSearchCV, self).__init__(*args, **kwargs)
-
- def fit(self, X, *params, **kwparams):
- return super(GridSearchCV, self).fit(
- DataWrapper(X), *params, **kwparams)
-
- def predict(self, X, *params, **kwparams):
- return super(GridSearchCV, self).predict(
- DataWrapper(X), *params, **kwparams)
-
-
-try:
- class RandomizedSearchCV(SKRandomizedSearchCV):
-
- def __init__(self, *args, **kwargs):
- warnings.warn(DEPRECATION_MSG, DeprecationWarning)
- super(RandomizedSearchCV, self).__init__(*args, **kwargs)
-
- def fit(self, X, *params, **kwparams):
- return super(RandomizedSearchCV, self).fit(
- DataWrapper(X), *params, **kwparams)
-
- def predict(self, X, *params, **kwparams):
- return super(RandomizedSearchCV, self).predict(
- DataWrapper(X), *params, **kwparams)
-except AttributeError:
- pass
-
-
class DataWrapper(object):
def __init__(self, df):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/sklearn-pandas-1.8.0/sklearn_pandas/dataframe_mapper.py
new/sklearn-pandas-2.0.2/sklearn_pandas/dataframe_mapper.py
--- old/sklearn-pandas-1.8.0/sklearn_pandas/dataframe_mapper.py 2018-08-15
14:42:44.000000000 +0200
+++ new/sklearn-pandas-2.0.2/sklearn_pandas/dataframe_mapper.py 2020-10-01
22:35:05.000000000 +0200
@@ -1,4 +1,3 @@
-import sys
import contextlib
import pandas as pd
@@ -9,12 +8,7 @@
from .cross_validation import DataWrapper
from .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline
-PY3 = sys.version_info[0] == 3
-if PY3:
- string_types = text_type = str
-else:
- string_types = basestring # noqa
- text_type = unicode # noqa
+string_types = text_type = str
def _handle_feature(fea):
@@ -69,7 +63,7 @@
"""
def __init__(self, features, default=False, sparse=False, df_out=False,
- input_df=False):
+ input_df=False, drop_cols=None):
"""
Params:
@@ -77,7 +71,7 @@
The first element is the pandas column selector. This can
be a string (for one column) or a list of strings.
The second element is an object that supports
- sklearn's transform interface, or a list of such objects.
+ sklearn's transform interface, or a list of such objects
The third element is optional and, if present, must be
a dictionary with the options to apply to the
transformation. Example: {'alias': 'day_of_week'}
@@ -101,14 +95,17 @@
input_df If ``True`` pass the selected columns to the transformers
as a pandas DataFrame or Series. Otherwise pass them as a
numpy array. Defaults to ``False``.
+
+ drop_cols List of columns to be dropped. Defaults to None.
+
"""
self.features = features
- self.built_features = None
self.default = default
self.built_default = None
self.sparse = sparse
self.df_out = df_out
self.input_df = input_df
+ self.drop_cols = [] if drop_cols is None else drop_cols
self.transformed_names_ = []
if (df_out and (sparse or default)):
@@ -149,7 +146,8 @@
"""
X_columns = list(X.columns)
return [column for column in X_columns if
- column not in self._selected_columns]
+ column not in self._selected_columns
+ and column not in self.drop_cols]
def __setstate__(self, state):
# compatibility for older versions of sklearn-pandas
@@ -158,6 +156,7 @@
self.default = state.get('default', False)
self.df_out = state.get('df_out', False)
self.input_df = state.get('input_df', False)
+ self.drop_cols = state.get('drop_cols', [])
self.built_features = state.get('built_features', self.features)
self.built_default = state.get('built_default', self.default)
self.transformed_names_ = state.get('transformed_names_', [])
@@ -209,7 +208,6 @@
"""
self._build()
-
for columns, transformers, options in self.built_features:
input_df = options.get('input_df', self.input_df)
@@ -226,7 +224,8 @@
_call_fit(self.built_default.fit, Xt, y)
return self
- def get_names(self, columns, transformer, x, alias=None):
+ def get_names(self, columns, transformer, x, alias=None, prefix='',
+ suffix=''):
"""
Return verbose names for the transformed columns.
@@ -242,6 +241,9 @@
else:
name = columns
num_cols = x.shape[1] if len(x.shape) > 1 else 1
+
+ output = []
+
if num_cols > 1:
# If there are as many columns as classes in the transformer,
# infer column names from classes names.
@@ -257,13 +259,19 @@
# Otherwise use the only estimator present
else:
names = _get_feature_names(transformer)
+
if names is not None and len(names) == num_cols:
- return ['%s_%s' % (name, o) for o in names]
- # otherwise, return name concatenated with '_1', '_2', etc.
+ output = [f"{name}_{o}" for o in names]
+ # otherwise, return name concatenated with '_1', '_2', etc.
else:
- return [name + '_' + str(o) for o in range(num_cols)]
+ output = [name + '_' + str(o) for o in range(num_cols)]
else:
- return [name]
+ output = [name]
+
+ if prefix == suffix == "":
+ return output
+
+ return ['{}{}{}'.format(prefix, x, suffix) for x in output]
def get_dtypes(self, extracted):
dtypes_features = [self.get_dtype(ex) for ex in extracted]
@@ -307,8 +315,11 @@
extracted.append(_handle_feature(Xt))
alias = options.get('alias')
+ prefix = options.get('prefix', '')
+ suffix = options.get('suffix', '')
+
self.transformed_names_ += self.get_names(
- columns, transformers, Xt, alias)
+ columns, transformers, Xt, alias, prefix, suffix)
# handle features not explicitly selected
if self.built_default is not False:
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/sklearn-pandas-1.8.0/sklearn_pandas/features_generator.py
new/sklearn-pandas-2.0.2/sklearn_pandas/features_generator.py
--- old/sklearn-pandas-1.8.0/sklearn_pandas/features_generator.py
2017-10-22 19:58:20.000000000 +0200
+++ new/sklearn-pandas-2.0.2/sklearn_pandas/features_generator.py
2020-09-07 03:30:35.000000000 +0200
@@ -1,4 +1,4 @@
-def gen_features(columns, classes=None):
+def gen_features(columns, classes=None, prefix='', suffix=''):
"""Generates a feature definition list which can be passed
into DataFrameMapper
@@ -25,6 +25,10 @@
If None value selected, then each feature left as is.
+ prefix add prefix to transformed column names
+
+ suffix add suffix to transformed column names.
+
"""
if classes is None:
return [(column, None) for column in columns]
@@ -34,9 +38,15 @@
for column in columns:
feature_transformers = []
+ arguments = {}
+ if prefix and prefix != "":
+ arguments['prefix'] = prefix
+ if suffix and suffix != "":
+ arguments['suffix'] = suffix
+
classes = [cls for cls in classes if cls is not None]
if not classes:
- feature_defs.append((column, None))
+ feature_defs.append((column, None, arguments))
else:
for definition in classes:
@@ -50,6 +60,6 @@
if not feature_transformers:
feature_transformers = None
- feature_defs.append((column, feature_transformers))
+ feature_defs.append((column, feature_transformers, arguments))
return feature_defs
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/sklearn-pandas-1.8.0/sklearn_pandas/transformers.py
new/sklearn-pandas-2.0.2/sklearn_pandas/transformers.py
--- old/sklearn-pandas-1.8.0/sklearn_pandas/transformers.py 2018-12-01
20:13:29.000000000 +0100
+++ new/sklearn-pandas-2.0.2/sklearn_pandas/transformers.py 2020-09-07
03:30:35.000000000 +0200
@@ -1,8 +1,6 @@
import numpy as np
import pandas as pd
-
-from sklearn.base import BaseEstimator, TransformerMixin
-from sklearn.utils.validation import check_is_fitted
+from sklearn.base import TransformerMixin
def _get_mask(X, value):
@@ -17,136 +15,33 @@
return X == value
-class CategoricalImputer(BaseEstimator, TransformerMixin):
+class NumericalTransformer(TransformerMixin):
"""
- Impute missing values from a categorical/string np.ndarray or pd.Series
- with the most frequent value on the training data.
-
- Parameters
- ----------
- missing_values : string or "NaN", optional (default="NaN")
- The placeholder for the missing values. All occurrences of
- `missing_values` will be imputed. None and np.nan are treated
- as being the same, use the string value "NaN" for them.
-
- copy : boolean, optional (default=True)
- If True, a copy of X will be created.
-
- strategy : string, optional (default = 'most_frequent')
- The imputation strategy.
-
- - If "most_frequent", then replace missing using the most frequent
- value along each column. Can be used with strings or numeric data.
- - If "constant", then replace missing values with fill_value. Can be
- used with strings or numeric data.
-
- fill_value : string, optional (default='?')
- The value that all instances of `missing_values` are replaced
- with if `strategy` is set to `constant`. This is useful if
- you don't want to impute with the mode, or if there are multiple
- modes in your data and you want to choose a particular one. If
- `strategy` is not set to `constant`, this parameter is ignored.
-
- Attributes
- ----------
- fill_ : str
- The imputation fill value
-
+ Provides commonly used numerical transformers.
"""
+ SUPPORTED_FUNCTIONS = ['log', 'log1p']
- def __init__(
- self,
- missing_values='NaN',
- strategy='most_frequent',
- fill_value='?',
- copy=True
- ):
- self.missing_values = missing_values
- self.copy = copy
- self.fill_value = fill_value
- self.strategy = strategy
-
- strategies = ['constant', 'most_frequent']
- if self.strategy not in strategies:
- raise ValueError(
- 'Strategy {0} not in {1}'.format(self.strategy, strategies)
- )
-
- def fit(self, X, y=None):
- """
-
- Get the most frequent value.
-
- Parameters
- ----------
- X : np.ndarray or pd.Series
- Training data.
-
- y : Passthrough for ``Pipeline`` compatibility.
-
- Returns
- -------
- self: CategoricalImputer
- """
-
- mask = _get_mask(X, self.missing_values)
- X = X[~mask]
- if self.strategy == 'most_frequent':
- modes = pd.Series(X).mode()
- elif self.strategy == 'constant':
- modes = np.array([self.fill_value])
- if modes.shape[0] == 0:
- raise ValueError('Data is empty or all values are null')
- elif modes.shape[0] > 1:
- raise ValueError('No value is repeated more than '
- 'once in the column')
- else:
- self.fill_ = modes[0]
-
- return self
-
- def transform(self, X):
+ def __init__(self, func):
"""
+ Params
- Replaces missing values in the input data with the most frequent value
- of the training data.
-
- Parameters
- ----------
- X : np.ndarray or pd.Series
- Data with values to be imputed.
-
- Returns
- -------
- np.ndarray
- Data with imputed values.
+ func function to apply to input columns. The function will be
+ applied to each value. Supported functions are defined
+ in SUPPORTED_FUNCTIONS variable. Throws assertion error if the
+ not supported.
"""
-
- check_is_fitted(self, 'fill_')
-
- if self.copy:
- X = X.copy()
-
- mask = _get_mask(X, self.missing_values)
- X[mask] = self.fill_
-
- return np.asarray(X)
-
-
-class FunctionTransformer(BaseEstimator, TransformerMixin):
- """
- Use this class to convert a random function into a
- transformer.
- """
-
- def __init__(self, func):
+ assert func in self.SUPPORTED_FUNCTIONS, \
+ f"Only following func are supported: {self.SUPPORTED_FUNCTIONS}"
+ super(NumericalTransformer, self).__init__()
self.__func = func
- def fit(self, x, y=None):
+ def fit(self, X, y=None):
return self
- def transform(self, x):
- return np.vectorize(self.__func)(x)
+ def transform(self, X, y=None):
+ if self.__func == 'log1p':
+ return np.vectorize(np.log1p)(X)
+ elif self.__func == 'log':
+ return np.vectorize(np.log)(X)
- def __call__(self, *args, **kwargs):
- return self.__func(*args, **kwargs)
+ raise ValueError(f"Invalid function name: {self.__func}")
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/PKG-INFO
new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/PKG-INFO
--- old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/PKG-INFO 2018-12-01
20:14:57.000000000 +0100
+++ new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/PKG-INFO 2020-10-01
22:54:52.000000000 +0200
@@ -1,12 +1,11 @@
-Metadata-Version: 1.0
+Metadata-Version: 1.2
Name: sklearn-pandas
-Version: 1.8.0
+Version: 2.0.2
Summary: Pandas integration with sklearn
-Home-page: https://github.com/paulgb/sklearn-pandas
-Author: Israel Saeta Pérez
-Author-email: [email protected]
+Home-page: https://github.com/scikit-learn-contrib/sklearn-pandas
+Maintainer: Ritesh Agrawal
+Maintainer-email: [email protected]
License: UNKNOWN
-Description-Content-Type: UNKNOWN
Description: UNKNOWN
Keywords: scikit,sklearn,pandas
Platform: UNKNOWN
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/SOURCES.txt
new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/SOURCES.txt
--- old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/SOURCES.txt
2018-12-01 20:14:57.000000000 +0100
+++ new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/SOURCES.txt
2020-10-01 22:54:52.000000000 +0200
@@ -4,7 +4,6 @@
setup.cfg
setup.py
sklearn_pandas/__init__.py
-sklearn_pandas/categorical_imputer.py
sklearn_pandas/cross_validation.py
sklearn_pandas/dataframe_mapper.py
sklearn_pandas/features_generator.py
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/requires.txt
new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/requires.txt
--- old/sklearn-pandas-1.8.0/sklearn_pandas.egg-info/requires.txt
2018-12-01 20:14:57.000000000 +0100
+++ new/sklearn-pandas-2.0.2/sklearn_pandas.egg-info/requires.txt
2020-10-01 22:54:52.000000000 +0200
@@ -1,4 +1,4 @@
-scikit-learn>=0.15.0
-scipy>=0.14
-pandas>=0.11.0
-numpy>=1.6.1
+scikit-learn>=0.23.0
+scipy>=1.4.1
+pandas>=1.0.5
+numpy>=1.18.1