Hi,

I am trying to understand the exact formula for tf-idf.

vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None)
wordtfidf = vectorizer.fit_transform(texts)

Given the following 3 documents (id1, id2, id3 are the IDs of the
three documents).

id1     AA BB BB CC CC CC
id2     AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD
id3     AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF

The results are the following.

id1▸    cc▸     5.079441541679836¬
id1▸    bb▸     2.5753641449035616¬
id1▸    aa▸     1.0¬
id2▸    dd▸     7.726092434710685¬
id2▸    bb▸     6.438410362258904¬
id2▸    aa▸     4.0¬
id3▸    ff▸     15.238324625039509¬
id3▸    dd▸     10.301456579614246¬
id3▸    aa▸     7.0¬

According to "6.2.3.4. Tf–idf term weighting" on the following page.

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1.

But I don't understand why tf-idf(id1, aa) is 1. This means that
tf(id1, aa) is 1, which is just the count of aa, shouldn't it be
divided by the number of terms in the doc id1, which should result in
1/6 instead of 1?

Thanks.

-- 
Regards,
Peng
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to