Hi List,
I was counting the number of non-zero coefficients of a SGDClassifier and
got a very strange ValueError after calling predict() again.
After some research it seems that our own sklearn.utils.fixes.count_nonzero
has a side effect that changes the type of the coef_ matrix. Any further
calls to densify/sparsify/predict just fails with this exception.
It seems that a similar bug is pending here:
https://github.com/scikit-learn/scikit-learn/issues/1968 ; on my code it
breaks with different scipy versions: .11, .12, .13, .13.1, numpy 1.8.0 and
sklearn git/HEAD.
Here is the traceback:
"""
Traceback (most recent call last):
File
"/home/ediemert/sources/oss/scikit-learn/examples/applications/sgd_elasticnet_bug.py",
line 22, in <module>
clf.predict(X_test)
File
"/home/ediemert/sources/oss/scikit-learn/sklearn/linear_model/base.py",
line 223, in predict
scores = self.decision_function(X)
File
"/home/ediemert/sources/oss/scikit-learn/sklearn/linear_model/base.py",
line 199, in decision_function
X = atleast2d_or_csr(X)
File
"/home/ediemert/sources/oss/scikit-learn/sklearn/utils/validation.py", line
148, in atleast2d_or_csr
"tocsr", force_all_finite)
File
"/home/ediemert/sources/oss/scikit-learn/sklearn/utils/validation.py", line
123, in _atleast2d_or_sparse
_assert_all_finite(X.data)
File
"/home/ediemert/sources/oss/scikit-learn/sklearn/utils/validation.py", line
39, in _assert_all_finite
if (X.dtype.char in np.typecodes['AllFloat'] and not
np.isfinite(X.sum())
File "/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py",
line 25, in _sum
out=out, keepdims=keepdims)
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line
183, in __bool__
raise ValueError("The truth value of an array with more than one "
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all().
"""
Ans a minimal script to reproduce the bug:
"""
from scipy.sparse.csr import csr_matrix
from sklearn import datasets
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.utils import shuffle
import numpy as np
from sklearn.utils.fixes import count_nonzero
np.random.seed(0)
twentyng_data = datasets.fetch_20newsgroups_vectorized(subset='all')
X, y = shuffle(twentyng_data.data, twentyng_data.target)
offset = int(X.shape[0] * 0.5)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
X_test = csr_matrix(X_test)
X_train = csr_matrix(X_train)
clf = SGDClassifier()
clf.fit(X_train, y_train)
clf.sparsify()
# uncomment next line to trigger the buggy behavior
#count_nonzero(clf.coef_)
clf.predict(X_test)
"""
Any ideas welcome !
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general