2014-08-23 15:44 GMT+02:00 Pavel Soriano <sorianopa...@gmail.com>:
> I don't know if this would be helpful to anybody or if this was already
> discussed. That is why I am asking if it is worthy to be pull requested.
> Gist URL :
> https://gist.github.com/psorianom/0b9d8a742fe0efe0fe82

Yes! BM25 is high on my wishlist. I was already wondering why the text
classification community wasn't using it as a baseline, since the IR
community has decided decades ago that it's a better model of term
importance than tf-idf.

However, I think the implementation should be in terms of a separate
BM25Transformer, not a further overload of the Christmas tree that is
TfidfVectorizer, to prevent getting even more invalid combinations of
parameter settings. The feature_extraction.text code is already a pain
to maintain. This is also important because of the two parameters that
need to be tuned in BM25 (which I don't immediately see in your code).

As for delta idf, I'd never heard of it, but I'm reading it now. If we
decide to implement it, then I'd rather do so in terms of a
SupervisedTf transformer that can also do more classical weighting
schemes like tf-chi² (term frequency × chi² test statistic) or tf-ig
(tf × information gain) [1, 2, 3].

[1] http://www.nmis.isti.cnr.it/debole/articoli/SAC03b.pdf
[2] https://www-old.comp.nus.edu.sg/~tancl/publications/j2009/PAMI2007-v3.pdf
[3] This paper from a guy at HP Research that I cannot find right now.

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to