Thanks, I removed the uneccessary fit method.

Regarding normalization, aren't the features automatically normalized with the l2 norm when using tfid?
      vectorizer = TfidfVectorizer()
      X = vectorizer.fit_transform(trainingTexts)

Just in case, I added the following but get the same results anyway:
      from sklearn import preprocessing
      normalizer = preprocessing.Normalizer().fit(X)
      X = normalizer.transform(X)

However, removing the parameters from
      vectorizer = TfidfVectorizer()
now gives me better (but still bad) output for LinearRegression:
      Accuracy: 0.40 (+/- 0.05)

I've updated these small changes to the zip that I uploaded here:
https://dl.dropbox.com/u/74279156/regression.zip

No idea what to try next...

Zach


On 12/08/2012 02:52, Andreas Mueller wrote:
Just a small comments:
You don't need to `fit` the models before using ``cross_valid_score``.
They are refit for each split anew.

Btw, have you tried normalizing your responses?

Cheers,
Andy


On 08/12/2012 07:02 AM, Zach Bastick wrote:
Sorry about that, the RTF reader is from the Pyth library:
http://pypi.python.org/pypi/pyth/

I think that's all that's needed.

Thanks for taking a look!

Zach


On 11/08/2012 22:55, Robert Layton wrote:
On 12 August 2012 15:35, Zach Bastick <[email protected] <mailto:[email protected]>> wrote:

    I have tried various machine learning algorithms from scikit
    learn but
    can't find a good prediction model.
    The features I'm using are the tf-idf of set of text documents,
    correlated with human ratings assigned to each document. I'm
    thinking
    that I must be doing something wrong as the scores can't be that bad
    (not to mention negative?)

    If someone could have a look at it, I'd really appreciate it. I
    didn't
    upload to a github gist because they won't let me upload the dataset
    directory. So I've uploaded my really short code (regression.py)
    AND the
    original data set (/texts) here (625K):
    https://dl.dropbox.com/u/74279156/regression.zip

    This is my output:
    C:\python code\program>python regression.py
    loading texts...
    n_samples: 53, n_features: 6284

    LinearRegresson
    [ 0.34662496  0.23446674  0.30332109  0.3163838 0.01607913]
    Accuracy: 0.24 (+/- 0.06)

    SVR linear
    [-0.05521329 -1.61280714 -0.67428098 -0.8805647  -2.20730703]
    Accuracy: -1.09 (+/- 0.37)

    SVR poly 4 degrees
    [-0.18814233 -1.78480475 -0.88158686 -1.05944432 -2.40284073]
    Accuracy: -1.26 (+/- 0.38)

    SVR sigmoid
    [-0.18814233 -1.78480475 -0.88158686 -1.05944432 -2.40284073]
    Accuracy: -1.26 (+/- 0.38)


    Please tell me what's wrong.. I'm dying to know how to get
    scikit-lean
    to predict based on this dataset.

    Thanks

    Zach

    
------------------------------------------------------------------------------
    Live Security Virtual Conference
    Exclusive live event will cover all the ways today's security and
    threat landscape has changed and how IT managers can respond.
    Discussions
    will include endpoint security, mobile security and the latest
    in malware
    threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



Where can I get the plugins from?
--

Public key at: http://pgp.mit.edu/ Search for this email address and select the key from "2011-08-19" (key id: 54BA8735)



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats.http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats.http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to