Hi there, unfortunately I currently don't have time to walk through your example, but I wrote down how the Tf-idf in sklearn works using some examples here: https://github.com/rasbt/pattern_classification/blob/90710922e4f4d7e3f432221b8a4d2ec1dd2d9dc9/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb
(I remember that we used it to write portions of the documentation in sklearn later) Best, Sebastian > On Feb 1, 2020, at 12:53 PM, Peng Yu <pengyu...@gmail.com> wrote: > > Hi, > > I am trying to understand the exact formula for tf-idf. > > vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None) > wordtfidf = vectorizer.fit_transform(texts) > > Given the following 3 documents (id1, id2, id3 are the IDs of the > three documents). > > id1 AA BB BB CC CC CC > id2 AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD > id3 AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF > > The results are the following. > > id1▸ cc▸ 5.079441541679836¬ > id1▸ bb▸ 2.5753641449035616¬ > id1▸ aa▸ 1.0¬ > id2▸ dd▸ 7.726092434710685¬ > id2▸ bb▸ 6.438410362258904¬ > id2▸ aa▸ 4.0¬ > id3▸ ff▸ 15.238324625039509¬ > id3▸ dd▸ 10.301456579614246¬ > id3▸ aa▸ 7.0¬ > > According to "6.2.3.4. Tf–idf term weighting" on the following page. > > https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction > > For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1. > > But I don't understand why tf-idf(id1, aa) is 1. This means that > tf(id1, aa) is 1, which is just the count of aa, shouldn't it be > divided by the number of terms in the doc id1, which should result in > 1/6 instead of 1? > > Thanks. > > -- > Regards, > Peng > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn