If I understand your question correctly, the answer is yes! If you want a clearer response, you might clarify what the alternative hypothesis is to your suggestion.
On 19 August 2014 03:13, ZORAIDA HIDALGO SANCHEZ < [email protected]> wrote: > I am using TdidfTransformer on documents that I need to classify. In order > to evaluate the model, I need to apply the whole > pipeline(TdidfTransformer, Classifier) to the test dataset. On the > training step, I am using a cross-validation and in each iteration I am > applying tdidf.fit_transform/transform and classifer.fit/predict to train > and test folds respectively. My question is: once I have Œtuned¹ the > classifier parameters, do I have to Œfit_transform' the whole train > dataset(for both, Tdidf and the classifier) and then transform and predict > the test dataset? > > Thanks. > > El 18/08/14 18:43, "Olivier Grisel" <[email protected]> escribió: > > >2014-08-18 18:28 GMT+02:00 <[email protected]>: > >> > >> > >> > >> On Mon, Aug 18, 2014 at 12:15 PM, Olivier Grisel > >><[email protected]> > >> wrote: > >>> > >>> Le 18 août 2014 16:16, "Sebastian Raschka" <[email protected]> a > >>>écrit > >>> : > >>> > >>> > >>> > > >>> > > >>> > On Aug 18, 2014, at 3:46 AM, Olivier Grisel > >>><[email protected]> > >>> > wrote: > >>> > > >>> > > But the sklearn.cross_validation.Bootstrap currently implemented in > >>> > > sklearn is a cross validation iterator, not a generic resampling > >>>method to > >>> > > estimate variance or confidence intervals. Don't be mislead by the > >>>name. If > >>> > > we chose to deprecate and then remove this class, it's precisely > >>>because it > >>> > > causes confusion. > >>> > > >>> > Hm, I can kind of see why the Bootstrap calls was initially put into > >>> > sklearn.cross_validation, technically, the "approaches" (cross > >>>validation, > >>> > bootstrap, jackknife) are very related. The only difference is that > >>>you have > >>> > sampling "with replacement" in the bootstrap approach and that you > >>>would > >>> > typically want to have >1000 iterations. > >>> > >>> > So, the suggestion would be to remove Bootstrap and use > >>> > sklearn.utils.resample in future? > >>> > >>> Well it depends why do you want to use bootstrapping for. If it's for > >>> model evaluation (estimation of some validation score), then the > >>>recommended > >>> way is to use ShuffleSplit or StratifiedShuffleSplit instead. If you > >>>want > >>> generic bootstrap estimation features such as confidence interval > >>>estimation > >>> (that does not exist in scikit-learn by the way), then I would > >>>recommend you > >>> to have a look at scikits.bootstrap [1] which also implement bias > >>>correction > >>> for skewed distribution which is non-trivial to do manually. > >>> > >>> [1] https://scikits.appspot.com/bootstrap > >>> > >>> sklearn.utils is meant only for internal use in the scikit-learn > >>>project. > >>> For instance sklearn.utils.resample is useful to implement resampling > >>> internally in bagging models if I remember correctly. > >>> > >>> > I would say that it is good that the Bootstrap is implemented like > >>>an CV > >>> > object, > >>> > >>> I precisely think the opposite. There is no point in using sampling > >>>with > >>> replacement vs sampling without replacement to estimate the validation > >>>error > >>> of a model. Traditional strategies for cross-validation as implemented > >>>in > >>> Shuffle & Split are as flexible and simpler to interpret than our weird > >>> Bootstrap cross-validation iterator. > >>> > >>> See also: http://youtu.be/BzHz0J9a6k0?t=9m38s > >>> > >>> > since it would make the "estimate" and "error" calculation more > >>> > convenient, right? > >>> > >>> I don't understand what you mean "estimate" by "error". Both the model > >>> parameters, its individual predictions and its cross-validation scores > >>>or > >>> errors can be called "estimates": anything that is derived from > >>>sampled data > >>> points is an estimate. > >> > >> > >> Just a remark from the sidelines, > >> (I hope to get bootstrap and cross-validation iterators into the next > >> version of statsmodels, borrowing some of the ideas and code from > >> scikit-learn, but emphasis in statsmodels will be on bootstrap and > >> permutation iterators.) > >> > >> What I think sklearn doesn't have, are early stopping with randomized > >> selection for cross-validation iterators. If LOO/jacknife are expensive > >>to > >> calculate for all LOO sets. Can you randomly select among the LOO sets, > >>or > >> similar for other iterators? > > > >No, but that's would be good idea for ShuffleSplit as well. If I > >understand correctly, you would pass something like tolerance > >parameter (e.g. I want a validation score with precise to 2 decimals) > >and use as few iterations as possible to each that precision and then > >stop sampling. Is that right? > > > >> Similar, permutation inference is often difficult because the set of > >> permutations is getting too large, then bootstrap is the usual > >>alternative > >> for larger samples. > >> > >> (I may be incorrect since I only briefly looked at the changes to your > >> cross-validation.) > > > >One thing to keep in mind is that sklearn.cross_validation.Bootstrap > >is not the real bootstrap: it's a random permutation + split + random > >sampling with replacement on both sides of the split independently: > > > > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_val > >idation.py#L718 > > > >This 2 steps procedures is done to make sure that no test samples is > >part of the training fold at each iteration. A more natural way to > >respect that constraint would be to sample with replacement from the > >full dataset and then use out-of-bags samples for the validation set. > >But then you would loose control on the size of the test fold. This > >second strategy is more like the real bootstrap and is the one I > >should have implemented initially instead of the weird beast that > >sklearn.cross_validation.Bootstrap is currently. > > > >-- > >Olivier > > > >-------------------------------------------------------------------------- > >---- > >_______________________________________________ > >Scikit-learn-general mailing list > >[email protected] > >https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ________________________________ > > Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, > puede contener información privilegiada o confidencial y es para uso > exclusivo de la persona o entidad de destino. Si no es usted. el > destinatario indicado, queda notificado de que la lectura, utilización, > divulgación y/o copia sin autorización puede estar prohibida en virtud de > la legislación vigente. Si ha recibido este mensaje por error, le rogamos > que nos lo comunique inmediatamente por esta misma vía y proceda a su > destrucción. > > The information contained in this transmission is privileged and > confidential information intended only for the use of the individual or > entity named above. If the reader of this message is not the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this communication is strictly prohibited. If you have received > this transmission in error, do not read it. Please immediately reply to the > sender that you have received this communication in error and then delete > it. > > Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, > pode conter informação privilegiada ou confidencial e é para uso exclusivo > da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário > indicado, fica notificado de que a leitura, utilização, divulgação e/ou > cópia sem autorização pode estar proibida em virtude da legislação vigente. > Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique > imediatamente por esta mesma via e proceda a sua destruição > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
