Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

Osman Başkaya Thu, 24 Jan 2013 11:29:31 -0800

>
> Even there is no parameter optimization for Weka, it looks significantly
> better for these data. Is there something I missed?



I am correcting my conclusion:

Even there is no parameter optimization for Weka, it looks significantly
better for first three words. Is there something I missed?

I am also trying to understand whether Weka makes some pre-processing (such
as mean centering, grid search etc.). Is there anyone know that?

Sorry for double postings.

On Thu, Jan 24, 2013 at 9:23 PM, Osman Başkaya
<osman.bask...@computer.org>wrote:

> Dear Olivier and Gael,
>
> Thank you guys.
>
> Olivier,
>
>  Do you mean each feature vector sum to 1, right?
>
>
> Yes and their values start and end 0 and 1 respectively. These are
> probability distribution actually.
>
>  You should never use the default settings of a classifier to compare
>> scores. Always grid search the optimal values of the most impacting
>> hyperparameters. In the case of LogisticRegression you should grid
>> search the regularization parameter which is named 'C'.
>> Here is the documentation for grid search:
>>   http://scikit-learn.org/dev/modules/grid_search.html
>
>
> When waiting the answers, I actually tried changing the regularization
> parameter (C) and accuracy increased.
>
> I tried grid search to choose best parameters for Logistic Regression and
> scores:
>
>
>                 *  WORD *                *Scikit (before grid)*
> * Scikit (after grid)  *                   *Weka*
>
>    - accommodate             0.3                                 0.47
>                                           0.667
>    - bow                              0.05
>    0.35                                            0.681818
>    - display                         0.475
>    0.525                                          0.70
>    - haunt                            0.575
>    0.78                                            0.53
>    - owe                              0.2533
>    0.55                                            0.4375
>
>
> Note that I haven't make any grid search for Weka yet. I am not familiar
> with Weka but when I pick the right parameter(s) for Weka's Logistic, I
> will share my results.
>
> Even there is no parameter optimization for Weka, it looks significantly
> better for these data. Is there something I missed?
>
> My new code snippet here: http://pastebin.com/A6xPYVH1
>
>
>
> On Thu, Jan 24, 2013 at 8:21 PM, Olivier Grisel 
> <olivier.gri...@ensta.org>wrote:
>
>> 2013/1/24 O. B. <thyme....@gmail.com>:
>> > Sorry I forgot the mention:
>> >
>> > Scikit's Logistic Regression is incredibly fast compared to Weka. Weka's
>> > implementation (mostly based on this paper) is slow as well as VERY
>> memory
>> > intensive. Sometimes it wasn't enough to allocate 3 GB as heap size. My
>> > dataset (words in above have not more than 100 instance) is very small
>> > because I use LR word by word.
>> >
>> > Is this the case because scikit's LR uses liblinear library?
>> >
>> > Thank you
>> >
>> > On Thu, Jan 24, 2013 at 5:25 PM, O. B. <thyme....@gmail.com> wrote:
>> >>
>> >> Hello all,
>> >>
>> >> I have some problem with my experiments. I used Logistic Regression
>> (LR)
>> >> to classify words senses. We have gold tags for (target set) each word
>> >> instance.
>> >>
>> >> I did 10 fold cross validation. Some words in my dataset have more than
>> >> two senses so I wrapped logistic regression with OneVsRestClassifier.
>>
>> You don't need to wrap LogisticRegression in a OneVsRestClassifier
>> object as it's already using OvR / OvA for handling multiclass
>> internally as explained in the doc:
>>
>> http://scikit-learn.org/dev/modules/multiclass.html
>>
>> > The
>> >> code is here. Accuracy was not impressive and so I suspect if there
>> was an
>> >> error in my code.  So I picked five words to classify using LR on
>> Weka. I
>> >> used default settings on Weka
>>
>> You should never use the default settings of a classifier to compare
>> scores. Always grid search the optimal values of the most impacting
>> hyperparameters. In the case of LogisticRegression you should grid
>> search the regularization parameter which is named 'C'.
>>
>> Here is the documentation for grid search:
>>
>>   http://scikit-learn.org/dev/modules/grid_search.html
>>
>> > and these are the results:
>> >>
>> >>                WORD                    Scikit                Weka
>> >>
>> >> accommodate             0.3                   0.667
>> >> bow                              0.05                 0.681818
>> >> display                         0.475               0.70
>> >> haunt                            0.575               0.53
>> >> owe                              0.2533             0.4375
>> >
>> >>
>> >> This are the (correct_label / total_label) scores. Except haunt, scores
>> >> are not consistent and scikit's are significantly lower than Weka. I
>> do not
>> >> say scikit has a bug or something, most likely there is a problem in
>> my code
>> >> or Weka makes some pre-processing instead of using raw data directly.
>> Could
>> >> you explain why is there a huge differences between Scikit and Weka
>> scores.
>> >>
>> >> Every features have sum to 1 and their values are between 0 and 1.
>>
>> Do you mean each feature vector sum to 1, right?
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>>
>> ------------------------------------------------------------------------------
>> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
>> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
>> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
>> MVPs and experts. ON SALE this month only -- learn more at:
>> http://p.sf.net/sfu/learnnow-d2d
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> --
> Osman Başkaya
> Koc University
> MS Student | Computer Science and Engineering
>



-- 
Osman Başkaya
Koc University
MS Student | Computer Science and Engineering

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

Reply via email to