Hi Apurva,

if you consider the operations done by the augmented frequency and the cosine normalization independently from everything else, they are somewhat similar. The normalization by max in a p-norm with p→+āˆž . So apart from the 0.5 offset, both are can be seen document length normalization with a different p value.

However, in TF-IDF you you would typically have an IDF document weighting operation between the term frequency weighting and the normalization, in which case the effect of both will be quite different. Generally I find that the SMART IR notation is very useful to represent different phases of the TF-IDF transformation.

The default parameters of TfidfTransformer is a good choice that will work well in most cases. Also, depending on the algorithm that you use afterwards, not having your data normalized by a an actual norm (e.g. cosine) may be sub-optimal. Still, if you want to fine tune your document normalization have a look at the "Pivoted Document Length Normalization" paper by Singhal et al. There is a compatible implementation of this and a few other TF-IDF schemes in http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html

In the end, it's probably easier to try different options on your dataset to see what works and what doesn't. You could just determine it by cross-validating..

--
Roman

On 27/09/17 13:53, Apurva Nandan wrote:
Hello,

Could anybody tell me the difference between using augmented frequency
(which is used for weighting term frequencies to eliminate the bias
towards larger documents) and cosine normalization (l2 norm which
scikit-learn uses for TfidfTransformer).
Augmented frequency is given by the following equation. It tries to
divide the natural term frequency by the maximum frequency of any term
in the document.

Inline image 1

Do they both do the same thing when it comes to eliminating bias towards
larger documents? I suppose scikit-learn uses the natural term freq, and
using cosine normalization is enabled with using norm=l2

Any help would be appreciated!

- Apurva


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to