Hi Apurva,
if you consider the operations done by the augmented frequency and the
cosine normalization independently from everything else, they are
somewhat similar. The normalization by max in a p-norm with pā+ā . So
apart from the 0.5 offset, both are can be seen document length
normalization with a different p value.
However, in TF-IDF you you would typically have an IDF document
weighting operation between the term frequency weighting and the
normalization, in which case the effect of both will be quite different.
Generally I find that the SMART IR notation is very useful to represent
different phases of the TF-IDF transformation.
The default parameters of TfidfTransformer is a good choice that will
work well in most cases. Also, depending on the algorithm that you use
afterwards, not having your data normalized by a an actual norm (e.g.
cosine) may be sub-optimal. Still, if you want to fine tune your
document normalization have a look at the "Pivoted Document Length
Normalization" paper by Singhal et al. There is a compatible
implementation of this and a few other TF-IDF schemes in
http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html
In the end, it's probably easier to try different options on your
dataset to see what works and what doesn't. You could just determine it
by cross-validating..
--
Roman
On 27/09/17 13:53, Apurva Nandan wrote:
Hello,
Could anybody tell me the difference between using augmented frequency
(which is used for weighting term frequencies to eliminate the bias
towards larger documents) and cosine normalization (l2 norm which
scikit-learn uses for TfidfTransformer).
Augmented frequency is given by the following equation. It tries to
divide the natural term frequency by the maximum frequency of any term
in the document.
Inline image 1
Do they both do the same thing when it comes to eliminating bias towards
larger documents? I suppose scikit-learn uses the natural term freq, and
using cosine normalization is enabled with using norm=l2
Any help would be appreciated!
- Apurva
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn