I agree with Vlad that delta-IDF is interesting; but it is not well
supported by the community, and I'm not sure it is worth including ... yet.
As Lars points out (and as you suggest), there are other ways to supervise
feature weighting. I agree this has to be a separate transformer
(SupervisedTFIDFTransformer or SupervisedTermWeightingTransformer is fine),
and certainly not part of TFIDFVectorizer, in part because this
specifically binary weighting scheme may best be applied as a transformer
within OvR, while the vectorizer should be outside OvR. So I'm +0 on Delta
IDF.

I am +1 on including BM25. At first I had thought it might be nice to
include within TFIDFTransformer, in opposition to binary, sublinear and
linear tf currently supported there, but Lars' post made me reconsider. I
think a BM25 transformer is the way to go.

It would be nice if we can ensure that it is possible (and documented) to
mix and match tf and idf schemes from different transformers.

On 24 August 2014 01:06, Lars Buitinck <larsm...@gmail.com> wrote:

> 2014-08-23 15:44 GMT+02:00 Pavel Soriano <sorianopa...@gmail.com>:
> > I don't know if this would be helpful to anybody or if this was already
> > discussed. That is why I am asking if it is worthy to be pull requested.
> > Gist URL :
> > https://gist.github.com/psorianom/0b9d8a742fe0efe0fe82
>
> Yes! BM25 is high on my wishlist. I was already wondering why the text
> classification community wasn't using it as a baseline, since the IR
> community has decided decades ago that it's a better model of term
> importance than tf-idf.
>
> However, I think the implementation should be in terms of a separate
> BM25Transformer, not a further overload of the Christmas tree that is
> TfidfVectorizer, to prevent getting even more invalid combinations of
> parameter settings. The feature_extraction.text code is already a pain
> to maintain. This is also important because of the two parameters that
> need to be tuned in BM25 (which I don't immediately see in your code).
>
> As for delta idf, I'd never heard of it, but I'm reading it now. If we
> decide to implement it, then I'd rather do so in terms of a
> SupervisedTf transformer that can also do more classical weighting
> schemes like tf-chi² (term frequency × chi² test statistic) or tf-ig
> (tf × information gain) [1, 2, 3].
>
> [1] http://www.nmis.isti.cnr.it/debole/articoli/SAC03b.pdf
> [2]
> https://www-old.comp.nus.edu.sg/~tancl/publications/j2009/PAMI2007-v3.pdf
> [3] This paper from a guy at HP Research that I cannot find right now.
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to