Hi Sebastian,

Thank you so much for sending the link. So, by the looks of it, the 
modification is introduced so that we start weighting at 0 (or 1 after adding 
the plus 1 to the result of the log) those words that appear in all documents. 
Otherwise, they'd receive a negative value.

Thank you!
Best
Sole

On Tuesday, May 28th, 2024 at 4:52 PM, Sebastian Raschka 
<m...@sebastianraschka.com> wrote:

> Hi Sole,
>
> It’s been a long time, but I remember helping with drafting the Tf-idf text 
> in the documentation as part of a scikit-learn sprint at SciPy a looong time 
> ago where I mentioned this difference (since it initially surprised me, 
> because I couldn’t get it to match my from-scratch implementation). As far as 
> I remember, the sklearn version addressed some instability issues for certain 
> edge cases.
>
> I am not sure if that helps, but I have briefly compared the textbook vs the 
> sklearn tf-idf here: 
> https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb
>
> Best,
> Sebastian
>
> --
> Sebastian Raschka, PhD
> Machine learning and AI researcher, 
> [https://sebastianraschka.com](https://sebastianraschka.com/)
>
> Staff Research Engineer at Lightning AI, https://lightning.ai
>
> On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn 
> <scikit-learn@python.org>, wrote:
>
>> Hi guys,
>>
>> I'd like to understand why sklearn's implementation of tf-idf is different 
>> from the standard textbook notation as described in the docs: 
>> https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
>>
>> Do you have any reference that I could take a look at? I didn't manage to 
>> find them in the docs, maybe I missed something?
>>
>> Thank you!
>>
>> Best wishes
>> Sole
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to