2014-08-23 15:44 GMT+02:00 Pavel Soriano <sorianopa...@gmail.com>: > I don't know if this would be helpful to anybody or if this was already > discussed. That is why I am asking if it is worthy to be pull requested. > Gist URL : > https://gist.github.com/psorianom/0b9d8a742fe0efe0fe82
Yes! BM25 is high on my wishlist. I was already wondering why the text classification community wasn't using it as a baseline, since the IR community has decided decades ago that it's a better model of term importance than tf-idf. However, I think the implementation should be in terms of a separate BM25Transformer, not a further overload of the Christmas tree that is TfidfVectorizer, to prevent getting even more invalid combinations of parameter settings. The feature_extraction.text code is already a pain to maintain. This is also important because of the two parameters that need to be tuned in BM25 (which I don't immediately see in your code). As for delta idf, I'd never heard of it, but I'm reading it now. If we decide to implement it, then I'd rather do so in terms of a SupervisedTf transformer that can also do more classical weighting schemes like tf-chi² (term frequency × chi² test statistic) or tf-ig (tf × information gain) [1, 2, 3]. [1] http://www.nmis.isti.cnr.it/debole/articoli/SAC03b.pdf [2] https://www-old.comp.nus.edu.sg/~tancl/publications/j2009/PAMI2007-v3.pdf [3] This paper from a guy at HP Research that I cannot find right now. ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general