Re: [scikit-learn] TF-IDF

Roman Yurchak Mon, 02 Oct 2017 00:31:09 -0700

Hi Apurva,

if you consider the operations done by the augmented frequency and thecosine normalization independently from everything else, they aresomewhat similar. The normalization by max in a p-norm with p→+∞ . Soapart from the 0.5 offset, both are can be seen document lengthnormalization with a different p value.

However, in TF-IDF you you would typically have an IDF documentweighting operation between the term frequency weighting and thenormalization, in which case the effect of both will be quite different.Generally I find that the SMART IR notation is very useful to representdifferent phases of the TF-IDF transformation.

The default parameters of TfidfTransformer is a good choice that willwork well in most cases. Also, depending on the algorithm that you useafterwards, not having your data normalized by a an actual norm (e.g.cosine) may be sub-optimal. Still, if you want to fine tune yourdocument normalization have a look at the "Pivoted Document LengthNormalization" paper by Singhal et al. There is a compatibleimplementation of this and a few other TF-IDF schemes inhttp://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html

In the end, it's probably easier to try different options on yourdataset to see what works and what doesn't. You could just determine itby cross-validating..


--
Roman

On 27/09/17 13:53, Apurva Nandan wrote:

Hello,

Could anybody tell me the difference between using augmented frequency
(which is used for weighting term frequencies to eliminate the bias
towards larger documents) and cosine normalization (l2 norm which
scikit-learn uses for TfidfTransformer).
Augmented frequency is given by the following equation. It tries to
divide the natural term frequency by the maximum frequency of any term
in the document.

Inline image 1

Do they both do the same thing when it comes to eliminating bias towards
larger documents? I suppose scikit-learn uses the natural term freq, and
using cosine normalization is enabled with using norm=l2

Any help would be appreciated!

- Apurva


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] TF-IDF

Reply via email to