Hi Sole,

It’s been a long time, but I remember helping with drafting the Tf-idf text in 
the documentation as part of a scikit-learn sprint at SciPy a looong time ago 
where I mentioned this difference (since it initially surprised me, because I 
couldn’t get it to match my from-scratch implementation). As far as I remember, 
the sklearn version addressed some instability issues for certain edge cases.

I am not sure if that helps, but I have briefly compared the textbook vs the 
sklearn tf-idf here: 
https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb

Best,
Sebastian





--
Sebastian Raschka, PhD
Machine learning and AI researcher, https://sebastianraschka.com

Staff Research Engineer at Lightning AI, https://lightning.ai


On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn 
<scikit-learn@python.org>, wrote:
> Hi guys,
>
> I'd like to understand why sklearn's implementation of tf-idf is different 
> from the standard textbook notation as described in the docs: 
> https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
>
> Do you have any reference that I could take a look at? I didn't manage to 
> find them in the docs, maybe I missed something?
>
> Thank you!
>
> Best wishes
> Sole
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to