Hi Pavel,

First of all, this is an interesting subject, thanks for bringing it
up! I fear that it's too domain-specific to go very deep in this
direction.
That being said, and trying to interpret your benchmarks, it seems
that Delta-idf might actually be interesting.
Or, more generally, the idea of class-aware idf. Does delta-idf extend
to multi-class settings?
I think it might be nice to have some sort of class-aware idf (I'm not
aware of existing options, do you know any other?)
I'm not convinced about the BM25 thing, though; and the fact that it
performed best on a very specific task is not convincing for its
general purpose usefulness.

Second of all, your results are actually more realistic than what
Paltoglou & Thelwall report, exactly because you're not using
leave-one-out, which is known to underestimate the error [1] (more
discussion here, exactly in the context of the paper you cite [2]).

I would suggest you submit a PR, with only the delta-idf variant
(which would simplify the code quite a bit), and make sure you add
tests, making sure it works well for multiclass, (what to do for
multilabel?), and that it falls back or raises an appropriate message
if fit in an unsupervised setting.  Dealing with gists is painful
since we can't see exactly what you change; furthermore, the mailing
list is less appropriate than the github issue tracker for such
specific matters.

Cheers,
Vlad

[1] 
http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo
[2] https://github.com/scikit-learn/scikit-learn/issues/1427

On Sat, Aug 23, 2014 at 3:44 PM, Pavel Soriano <sorianopa...@gmail.com> wrote:
> Greetings scikit,
>
> Last year I used delta idf and bm25 text weighting schemes with scikit
> classifiers for an opinion classification task. Today I decided to clean
> them and recheck them in order to propose it to scikit-learn text feature
> extractors.
>
> I only implemented delta idf and bm25 tf and delta bm25 idf as described in
> Paltoglou and Thelwall paper "A study of Information Retrieval weighting
> schemes for sentiment analysis". I used these two because they are,
> according to their experiments, the best performing schemes.
>
> I only tried with Pang's movie reviews dataset. I do not achieve their
> results, although I am not far away from them and I did not feel like
> fitting 2000 models with leave-one-out (the validation scheme used in their
> paper).
>
> The introduction of these schemes requires new parameters for
> TfidfTransformer and vectorizer. Namely, if we want to use the delta option
> (for binary classification), the bm25 option, if we want to use okapi bm25
> tf and/or idf. Even more, BM25 introduces new variables, b and k, which I
> fixed to static values. Of course this should (or maybe not) changed.
>
> I don't know if this would be helpful to anybody or if this was already
> discussed. That is why I am asking if it is worthy to be pull requested.
> Gist URL :
> https://gist.github.com/psorianom/0b9d8a742fe0efe0fe82
>
> Of course it is open to opinions.
>
> Thanks, have a good weekend!
>
> Pavel
>
> P.S: The results using different schemes (Pang's movie reviews balanced
> dataset) with stock LinearSVC:
> 1. BM25tf-idf
>     Accuracy: 0.852 (+/- 0.052)
>
> 2. Tf-idf Vanilla
>     Accuracy: 0.864 (+/- 0.081)
>
> 3.  Tf-Delta idf
>     Accuracy: 0.975 (+/- 0.046)
>
> 4. Tf-BM25idf
>     Accuracy: 0.744 (+/- 0.104)
>
> 5. Tf-Delta BM25idf
>     Accuracy: 0.949 (+/- 0.070)
>
> 6. BM25tf-Delta BM25idf
>     Accuracy: 0.943 (+/- 0.065)
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to