Hi, I am trying to understand the exact formula for tf-idf.
vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None) wordtfidf = vectorizer.fit_transform(texts) Given the following 3 documents (id1, id2, id3 are the IDs of the three documents). id1 AA BB BB CC CC CC id2 AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD id3 AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF The results are the following. id1▸ cc▸ 5.079441541679836¬ id1▸ bb▸ 2.5753641449035616¬ id1▸ aa▸ 1.0¬ id2▸ dd▸ 7.726092434710685¬ id2▸ bb▸ 6.438410362258904¬ id2▸ aa▸ 4.0¬ id3▸ ff▸ 15.238324625039509¬ id3▸ dd▸ 10.301456579614246¬ id3▸ aa▸ 7.0¬ According to "6.2.3.4. Tf–idf term weighting" on the following page. https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1. But I don't understand why tf-idf(id1, aa) is 1. This means that tf(id1, aa) is 1, which is just the count of aa, shouldn't it be divided by the number of terms in the doc id1, which should result in 1/6 instead of 1? Thanks. -- Regards, Peng _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn