Hi Basil, Scikit-learn isn't a library for information retrieval. The question is: how useful is the BM25 feature reweighting in a machine learning context?
This has been previously discussed at https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg11353.html. The whole thread is worth reading. Despite enthusiasm, it never got as far as a pull request. And still the major burden is showing that this transformation helps for classification/clustering. Joel On 14 June 2016 at 12:44, Basil Beirouti <basilbeiro...@gmail.com> wrote: > Hello all, > > You can use sklearn.feature_extraction.text.TfidfVectorizer to learn a > corpus of documents and rank them in order of relevance to a new previously > unseen query. > > BM25 works in a similar manner to TfidfVectorizer, but is more complex and > considered one of the most successful information retrieval algorithms. > > I currently have code that implements BM25 quite efficiently to learn a > corpus of documents and I want to modify/port it to align with the > fit-transform framework of sklearn. I think it could fit neatly into the > current codebase. > > Questions: > 1.) Would this be a desirable feature? > 2.) Any advice for how to proceed with this? Things to watch out for? > > Any and all advice is welcome. > > Thanks! > Basil > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn