Re: [scikit-learn] The exact formula used to compute the tf-idf

Sebastian Raschka Sat, 01 Feb 2020 11:30:13 -0800

Hi there,

unfortunately I currently don't have time to walk through your example, but I 
wrote down how the Tf-idf in sklearn works using some examples here: 
https://github.com/rasbt/pattern_classification/blob/90710922e4f4d7e3f432221b8a4d2ec1dd2d9dc9/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb


(I remember that we used it to write portions of the documentation in sklearn 
later)

Best,
Sebastian

> On Feb 1, 2020, at 12:53 PM, Peng Yu <pengyu...@gmail.com> wrote:
> 
> Hi,
> 
> I am trying to understand the exact formula for tf-idf.
> 
> vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None)
> wordtfidf = vectorizer.fit_transform(texts)
> 
> Given the following 3 documents (id1, id2, id3 are the IDs of the
> three documents).
> 
> id1   AA BB BB CC CC CC
> id2   AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD
> id3   AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF
> 
> The results are the following.
> 
> id1▸  cc▸     5.079441541679836¬
> id1▸  bb▸     2.5753641449035616¬
> id1▸  aa▸     1.0¬
> id2▸  dd▸     7.726092434710685¬
> id2▸  bb▸     6.438410362258904¬
> id2▸  aa▸     4.0¬
> id3▸  ff▸     15.238324625039509¬
> id3▸  dd▸     10.301456579614246¬
> id3▸  aa▸     7.0¬
> 
> According to "6.2.3.4. Tf–idf term weighting" on the following page.
> 
> https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
> 
> For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1.
> 
> But I don't understand why tf-idf(id1, aa) is 1. This means that
> tf(id1, aa) is 1, which is just the count of aa, shouldn't it be
> divided by the number of terms in the doc id1, which should result in
> 1/6 instead of 1?
> 
> Thanks.
> 
> -- 
> Regards,
> Peng
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] The exact formula used to compute the tf-idf

Reply via email to