Greetings scikit,
Last year I used delta idf and bm25 text weighting schemes with scikit
classifiers for an opinion classification task. Today I decided to clean
them and recheck them in order to propose it to scikit-learn text feature
extractors.
I only implemented delta idf and bm25 tf and delta bm25 idf as described in
Paltoglou and Thelwall paper "A study of Information Retrieval weighting
schemes for sentiment analysis". I used these two because they are,
according to their experiments, the best performing schemes.
I only tried with Pang's movie reviews dataset. I do not achieve their
results, although I am not far away from them and I did not feel like
fitting 2000 models with leave-one-out (the validation scheme used in their
paper).
The introduction of these schemes requires new parameters for
TfidfTransformer and vectorizer. Namely, if we want to use the *delta*
option (for binary classification), the *bm25* option, if we want to use
okapi bm25 tf and/or idf. Even more, BM25 introduces new variables, *b* and
*k*, which I fixed to static values. Of course this should (or maybe not)
changed.
I don't know if this would be helpful to anybody or if this was already
discussed. That is why I am asking if it is worthy to be pull requested.
Gist URL :
https://gist.github.com/psorianom/0b9d8a742fe0efe0fe82
Of course it is open to opinions.
Thanks, have a good weekend!
Pavel
P.S: The results using different schemes (Pang's movie reviews balanced
dataset) with stock LinearSVC:
1. BM25tf-idf
Accuracy: 0.852 (+/- 0.052)
2. Tf-idf Vanilla
Accuracy: 0.864 (+/- 0.081)
3. Tf-Delta idf
Accuracy: 0.975 (+/- 0.046)
4. Tf-BM25idf
Accuracy: 0.744 (+/- 0.104)
5. Tf-Delta BM25idf
Accuracy: 0.949 (+/- 0.070)
6. BM25tf-Delta BM25idf
Accuracy: 0.943 (+/- 0.065)
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general