Hello everyone, I'm new to the list so first of all thanks a lot for your work on this lib!
I need libsvm probability estimates as well as Logistic Regression (LR) in a three classes problem with a training data set size of about 5-6000 at 20-50 features. I am familiar with python and octave (regarding math even more with octave), but I would prefer python since I need all the programming stuff which can be tedious in octave... Reading lot of posts in discussions, scikit seems to offer the most advanced and well documented python binding for libsvm, but I also found following site: http://fseoane.net/blog/2010/fast-bindings-for-libsvm-in-scikitslearn He writes, that his bindings are implemented in scikit, but he also writes, that the code is in alpha status, that was three years ago. Well, I started with a simple problem with 65 data points with 2 features each. Questions: 1) Playing around with svm probability, it "seems" to work nice (https://dl.dropboxusercontent.com/u/95888530/svm_2.png). I just wanted to ask, how stable the python binding is regarding the website issue mentioned above. 2) Playing around with LR, the results "look interesting" (https://dl.dropboxusercontent.com/u/95888530/logreg_1.png), but I was not able to reproduce a model adopting/"overfitting" to every single data point, as in the SVM example plot (tried very large C). I did the first ML online class with Andrew Ng, there we implemented LR ourselves, but the feature creation from the data features was ad hoc (from x and y to x^2, y^2, x*y, x*y^2 and so on). I followed the same feature mapping here, at the end getting 28 features out of 2. It takes about 15-17 seconds to fit the model (on my simple example). I know feature selection/extraction itself is a big research topic, but maybe scikit can help me here without the need to read a dozen papers or maybe there are some rules of thumb. So is there any method within scikit, that could help me finding a feature mapping? I guess, that RandomizedLogisticRegression could help me somehow, but I didn't really get the point. I think, here I again have to provide the features myself and it will just help me finding the best by trying out randomly? On my real data set, mapping the 20-50 features to higher-dimensional spaces and trying out would probably take too long, if I consider the 15 seconds needed for a single model on the simple example (and here we are not yet talking about searching the optimal regularization C). Any suggestions? Cheers! Richard ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general