Hi Joel, Thanks for your response and for digging up that archived thread, it gives me a lot of clarity.
I see your point about BM25, but I think in most cases where TFIDF makes sense, BM25 makes sense as well, but it could be "overkill". Consider that TFIDF does not produce normalized results either <http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py>, If BM25 requires dimensionality reduction (eg. using LSA) , so too would TFIDF. The term-document matrix is the same size no matter which weighting scheme is used. The only difference is that BM25 produces better results when the corpus is large enough that the term frequency in a document, and the document frequency in the corpus, can vary considerably across a broad range of values.Maybe you could even say TFIDF and BM25 are the same equation except BM25 has a few additional hyperparameters (b and k). So is the advantage that BM25 provides for large diverse corpora with it? or is it marginal? Perhaps you can point me to some more examples where TFIDF is used (in supervised setting preferably) and I can plug in BM25 in place of TFIDF and see how it compares. Here are some I found: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html *(supervised)* http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py (*unsupervised)* Thank you! Basil PS: By the way, I'm not familiar with the delta-idf transform that Pavel mentions in the archive you linked, I'll have to delve deeper into that. I agree with the response to Pavel that he should be putting it in a separate class, not adding on to the TFIDF. I think it would take me about 6-8 weeks to adapt my code to the fit transform model and submit a pull request.
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
