Hi, it's in the making: https://github.com/scikit-learn/scikit-learn/pull/14696
On Fri, Oct 25, 2019 at 4:23 AM WONG Wing Mei <wong.wing...@uobgroup.com> wrote: > Can I ask whether we can use sample weight in gradient boosting? And how > to do it? > > -----Original Message----- > From: scikit-learn [mailto:scikit-learn-bounces+wong.wingmei= > uobgroup....@python.org] On Behalf Of scikit-learn-requ...@python.org > Sent: Friday, October 25, 2019 12:00 AM > To: scikit-learn@python.org > Subject: scikit-learn Digest, Vol 43, Issue 38 > > Send scikit-learn mailing list submissions to > scikit-learn@python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-requ...@python.org > > You can reach the person managing the list at > scikit-learn-ow...@python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Decision tree results sometimes different with scaled > data (Alexandre Gramfort) > 2. Reminder: Monday October 28th meeting (Adrin) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 24 Oct 2019 14:09:01 +0200 > From: Alexandre Gramfort <alexandre.gramf...@inria.fr> > To: Scikit-learn mailing list <scikit-learn@python.org> > Subject: Re: [scikit-learn] Decision tree results sometimes different > with scaled data > Message-ID: > < > cadeotzrh_bxhaqv6wdnrout4zxw_+eobj6_vmwma50anahk...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > another reason is that we take as threshold the mid point between sample > values > which is not invariant to arbitrary scaling of the features > > Alex > > > > On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lema?tre < > g.lemaitr...@gmail.com> > wrote: > > > Even with the same random state, it can happen that several features will > > lead to a best split and this split is chosen randomly (even with the > seed > > fixed - this is reported as an issue I think). Therefore, the rest of the > > tree could be different leading to different prediction. > > > > Another possibility is that we compute the difference between the current > > threshold and the next to be tried and only check the entropy if it is > > larger than a specific value (I would need to check the source code). > After > > scaling, it could happen that 2 feature values become too closed to be > > considered as a potential split which will make a difference between > scaled > > and scaled features. But this diff should be really small. > > > > This is the what I can think on the top of the head. > > > > Sent from my phone - sorry to be brief and potential misspell. > > *From:* geoffrey.bolm...@gmail.com > > *Sent:* 22 October 2019 11:34 > > *To:* scikit-learn@python.org > > *Reply to:* scikit-learn@python.org > > *Subject:* [scikit-learn] Decision tree results sometimes different with > > scaled data > > > > Hi all, > > > > First, let me thank you for the great job your guys are doing developing > > and maintaining such a popular library! > > > > As we all know decision trees are not impacted by scaled data because > > splits don't take into account distances between two values within a > > feature. > > > > However I experienced a strange behavior using sklearn decision tree > > algorithm. Sometimes results of the model are different depending if > input > > data has been scaled or not. > > > > To illustrate my point I ran experiments on the iris dataset consisting > of: > > > > - perform a train/test split > > - fit the training set and predict the test set > > - fit and predict again with standardized inputs (removing the mean > > and scaling to unit variance) > > - compare both model predictions > > > > Experiments have been ran 10,000 times with different random seeds (cf. > > traceback and code to reproduce it at the end). > > Results showed that for a bit more than 10% of the time we find at least > > one different prediction. Hopefully when it's the case only a few > > predictions differ, 1 or 2 most of the time. I checked the inputs causing > > different predictions and they are not the same depending of the run. > > > > I'm worried if the rate of different predictions could be larger for > other > > datasets... > > Do you have an idea where it come from, maybe due to floating point > errors > > or am I doing something wrong? > > > > Cheers, > > Geoffrey > > > > > > ------------------------------------------------------------ > > Traceback: > > ------------------------------------------------------------ > > Error rate: 12.22% > > > > Seed: 241862 > > All pred equal: False > > Not scale data confusion matrix: > > [[16 0 0] > > [ 0 17 0] > > [ 0 4 13]] > > Scale data confusion matrix: > > [[16 0 0] > > [ 0 15 2] > > [ 0 4 13]] > > ------------------------------------------------------------ > > Code: > > ------------------------------------------------------------ > > import numpy as np > > > > from sklearn.datasets import load_iris > > from sklearn.metrics import confusion_matrix > > from sklearn.model_selection import train_test_split > > from sklearn.preprocessing import StandardScaler > > from sklearn.tree import DecisionTreeClassifier > > > > > > X, y = load_iris(return_X_y=True) > > > > def run_experiment(X, y, seed): > > X_train, X_test, y_train, y_test = train_test_split( > > X, > > y, > > stratify=y, > > test_size=0.33, > > random_state=seed > > ) > > > > scaler = StandardScaler() > > > > X_train_scaled = scaler.fit_transform(X_train) > > X_test_scaled = scaler.transform(X_test) > > > > clf = DecisionTreeClassifier(random_state=seed) > > clf_scaled = DecisionTreeClassifier(random_state=seed) > > > > clf.fit(X_train, y_train) > > clf_scaled.fit(X_train_scaled, y_train) > > > > pred = clf.predict(X_test) > > pred_scaled = clf_scaled.predict(X_test_scaled) > > > > err = 0 if all(pred == pred_scaled) else 1 > > > > return err, y_test, pred, pred_scaled > > > > > > n_err, n_run, seed_err = 0, 10000, None > > > > for _ in range(n_run): > > seed = np.random.randint(10000000) > > err, _, _, _ = run_experiment(X, y, seed) > > n_err += err > > > > # keep aside last seed causing an error > > seed_err = seed if err == 1 else seed_err > > > > > > print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n') > > > > _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err) > > > > print(f'Seed: {seed_err}') > > print(f'All pred equal: {all(pred == pred_scaled)}') > > print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, > > pred)}') > > print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, > > pred_scaled)}') > > [image: Sent from Mailspring] > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191024/87feea0d/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Thu, 24 Oct 2019 17:10:26 +0200 > From: Adrin <adrin.jal...@gmail.com> > To: Scikit-learn mailing list <scikit-learn@python.org> > Subject: [scikit-learn] Reminder: Monday October 28th meeting > Message-ID: > < > caeorw48htwpxlwz2daksbas5utepg6kc_xgrwwvtdocvtd7...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi Scikit-learn people, > > This is a reminder that we'll be having our monthly call on Monday. > > Please put your thoughts and important topics you have in mind on > the project board: > https://github.com/scikit-learn/scikit-learn/projects/15 > > We'll be meeting on https://appear.in/amueller > > As usual, it'd be nice to have them on the board before the weekend :) > > See you on Monday, > Adrin. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191024/377798b6/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 38 > ******************************************** > UOB EMAIL DISCLAIMER > Any person receiving this email and any attachment(s) contained, > shall treat the information as confidential and not misuse, copy, > disclose, distribute or retain the information in any way that > amounts to a breach of confidentiality. If you are not the intended > recipient, please delete all copies of this email from your computer > system. As the integrity of this message cannot be guaranteed, > neither UOB nor any entity in the UOB Group shall be responsible for > the contents. Any opinion in this email may not necessarily represent > the opinion of UOB or any entity in the UOB Group. > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn