[scikit-learn] Added BM25Transformer and BM25Vectorizer to sklearn.feature_extraction.text

Basil Beirouti Sun, 10 Jul 2016 14:47:23 -0700

Hi all,

I have submitted a pull request to the main branch. I added BM25Transformer
and BM25Vectorizer, which are very similar to TFIDFTransformer and
TFIDFVectorizer, except they implement the BM25 algorithm instead. Would
really appreciate feedback on the quality of my work and how I can improve.


Sincerely,
Basil Beirouti

On Sat, Jul 9, 2016 at 11:00 AM, <[email protected]> wrote:

> Send scikit-learn mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: Create a "Feature_Weight" Parameter at
>       RandomForestRegressor (Andreas Mueller)
>    2. Re: Scikit learn GridSearchCV fit method ValueError Found
>       array with 0 sample (Maciek W?jcikowski)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 8 Jul 2016 17:00:42 -0400
> From: Andreas Mueller <[email protected]>
> To: Scikit-learn user and developer mailing list
>         <[email protected]>
> Subject: Re: [scikit-learn] Create a "Feature_Weight" Parameter at
>         RandomForestRegressor
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> You would need to implement a custom splitter, I think.
>
> On 07/04/2016 04:09 PM, [email protected] wrote:
> > I would like to give different weights to the features in the feature set
> > for the split task of Random Forest. Right now, only the MSE metric is
> > used to select the best split, and I want to do something like feature[i]
> > = MSI[i] * feature_weight[i]. This way, I'll be able to give more
> > importance to the features I already know that are better.
> >
> > In my mind, this change would be called on the fit function, something
> > like this: def fit(self, X, y, sample_weight, feature_weight):
> > And the feature_weight would be a vector with customized weights for all
> > features present in the dataset.
> >
> > What is the best way to do that? I'm having a really hard time figuring
> > out how to do this changes on the code.
> > Thanks a lot for your attention.
> >
> > Luiz Felipe
> >
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 8 Jul 2016 23:42:06 +0200
> From: Maciek W?jcikowski <[email protected]>
> To: Scikit-learn user and developer mailing list
>         <[email protected]>
> Subject: Re: [scikit-learn] Scikit learn GridSearchCV fit method
>         ValueError Found array with 0 sample
> Message-ID:
>         <
> cah2jjr35cfdjpqtnfn7+uscvkuvjepm9mjydwltgkiplewc...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Micha?,
>
> What are the class counts in that set? Maybe there is a problem with
> generating stratified subsamples (eg some classes get below 1 sample)?
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek W?jcikowski
> [email protected]
>
> 2016-07-08 17:22 GMT+02:00 Micha? Nowotka <[email protected]>:
>
> > Hi,
> >
> > Sorry for cross posting
> > (
> >
> http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample
> > )
> > but I don't know where is better to get help with my problem.
> > I'm working on a VM with Jupyter notebook server installed.
> > From time to time I add new notebooks and reevaluate old ones to see
> > if they still work.
> >
> > This notebook stopped working due to some changes in scikit-learn API
> > and some parameters become obsolete:
> >
> >
> >
> https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb
> >
> > I've created a corrected version of the notebook here:
> >
> > https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433
> >
> > But I'm stuck in cell 36 on this code:
> >
> > from sklearn.cross_validation import KFold
> > from sklearn.grid_search import GridSearchCV
> >
> > X_traina, X_testa, y_traina, y_testa =
> > cross_validation.train_test_split(x, y, test_size=0.95,
> > random_state=23)
> >
> > params = {'min_samples_split': [8], 'max_depth': [20],
> > 'min_samples_leaf': [1],'n_estimators':[200]}
> > cv = KFold(n=len(X_traina),n_folds=10,shuffle=True)
> > cv_stratified = StratifiedKFold(y_traina, n_folds=5)
> > gs = GridSearchCV(custom_forest, params,
> > cv=cv_stratified,verbose=1,refit=True)
> > gs.fit(X_traina,y_traina)
> >
> > This gives me:
> >
> > ValueError: Found array with 0 sample(s) (shape=(0, 491)) while a
> > minimum of 1 is required.
> >
> > Now I don't understand this because when I print shapes of the samples:
> >
> > print (X_traina.shape, X_testa.shape, y_traina.shape, y_testa.shape)
> >
> > I'm getting:
> >
> > ((78, 491), (1489, 491), (78,), (1489,))
> >
> > Interestingly, if I change the test_size parameter to 0.88 (like in
> > the example corrected notebook) it works and this is the highest value
> > where it works. For this value, the shapes are:
> >
> > ((188, 491), (1379, 491), (188,), (1379,))
> >
> > So the question is - what should I change in my code to make it work
> > for test_size set to 0.95 as well?
> >
> > Kind regards,
> >
> > Michal Nowotka
> > _______________________________________________
> > scikit-learn mailing list
> > [email protected]
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20160708/0ce8659a/attachment-0001.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 4, Issue 13
> *******************************************
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] Added BM25Transformer and BM25Vectorizer to sklearn.feature_extraction.text

Reply via email to