Hi all, I was playing around with KFold CV and found I need to transfer an X
(scipy sparse matrix after text vectorization) by todense() in order to work
with Kfold CV using following code:
----
for train_index, test_index in kf:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
----
Here is question 1: Instead of using X_train, X_test = X[train_index],
X[test_index] and working with the todense()ed numpy array X, is there any
other way to use Kfold CV for scipy sparse matrix for text classification
without todense() it?

Continue:
Though without figuring out another alternative, I kept exploring the KFold
CV using the todense()ed X.

To apply classifiers on the densed X, I tried both MultinomialNB
and BernoulliNB along with others, such as LinearSVC, KNeighborsClassifier,
RidgeClassifier.
However, while working on the numpy dense array which is todense()ed from
scipy text parse matrix, both Naive Bayes classifiers get error as shown
below. This problem is reproducible simply add two lines of todense() of X
test and X train from the
example<http://scikit-learn.sourceforge.net/stable/auto_examples/document_classification_20newsgroups.html#example-document-classification-20newsgroups-py>
.
Why only NB has error here? I cannot tell much from the error message I got.
Would like to learn more. Thanks for your kind help!
-------
Traceback (most recent call last):
  File "testCV.py", line 232, in <module>
    mnnb_results = benchmark(MultinomialNB(alpha=.01))
  File "testCV.py", line 166, in benchmark
    score = metrics.f1_score(y_test, pred)
  File "xxx\Python27\lib\site-packages\sklearn\metrics\metrics.py",
line 373, in f1_score
    return fbeta_score(y_true, y_pred, 1, pos_label=pos_label)
  File "xxx\Python27\lib\site-packages\sklearn\metrics\metrics.py",
line 326, in fbeta_score
    _, _, f, s = precision_recall_fscore_support(y_true, y_pred, beta=beta)
  File "xxx\Python27\lib\site-packages\sklearn\metrics\metrics.py",
line 420, in precision_recall_fscore_support
    y_true, y_pred = check_arrays(y_true, y_pred)
  File "xxx\Python27\lib\site-packages\sklearn\utils\__init__.py", l
ine 131, in check_arrays
    size, n_samples))
ValueError: Found array with dim 1. Expected 895
-----
------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to