Hello everyone,

I'm new to the list so first of all thanks a lot for your work on this 
lib!

I need libsvm probability estimates as well as Logistic Regression (LR) 
in a three classes problem with a training data set size of about 5-6000 
at 20-50 features. I am familiar with python and octave (regarding math 
even more with octave), but I would prefer python since I need all the 
programming stuff which can be tedious in octave...

Reading lot of posts in discussions, scikit seems to offer the most 
advanced and well documented python binding for libsvm, but I also found 
following site: 
http://fseoane.net/blog/2010/fast-bindings-for-libsvm-in-scikitslearn

He writes, that his bindings are implemented in scikit, but he also 
writes, that the code is in alpha status, that was three years ago. 
Well, I started with a simple problem with 65 data points with 2 
features each.

Questions:

1) Playing around with svm probability, it "seems" to work nice 
(https://dl.dropboxusercontent.com/u/95888530/svm_2.png). I just wanted 
to ask, how stable the python binding is regarding the website issue 
mentioned above.

2) Playing around with LR, the results "look interesting" 
(https://dl.dropboxusercontent.com/u/95888530/logreg_1.png), but I was 
not able to reproduce a model adopting/"overfitting" to every single 
data point, as in the SVM example plot (tried very large C). I did the 
first ML online class with Andrew Ng, there we implemented LR ourselves, 
but the feature creation from the data features was ad hoc (from x and y 
to x^2, y^2, x*y, x*y^2 and so on). I followed the same feature mapping 
here, at the end getting 28 features out of 2. It takes about 15-17 
seconds to fit the model (on my simple example).

I know feature selection/extraction itself is a big research topic, but 
maybe scikit can help me here without the need to read a dozen papers or 
maybe there are some rules of thumb. So is there any method within 
scikit, that could help me finding a feature mapping? I guess, that 
RandomizedLogisticRegression could help me somehow, but I didn't really 
get the point. I think, here I again have to provide the features myself 
and it will just help me finding the best by trying out randomly? On my 
real data set, mapping the 20-50 features to higher-dimensional spaces 
and trying out would probably take too long, if I consider the 15 
seconds needed for a single model on the simple example (and here we are 
not yet talking about searching the optimal regularization C). Any 
suggestions?


Cheers!

Richard

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to