Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

Joel Nothman Mon, 18 Aug 2014 15:46:12 -0700

If I understand your question correctly, the answer is yes!

If you want a clearer response, you might clarify what the alternative
hypothesis is to your suggestion.



On 19 August 2014 03:13, ZORAIDA HIDALGO SANCHEZ <
[email protected]> wrote:

> I am using TdidfTransformer on documents that I need to classify. In order
> to evaluate the model, I need to apply the whole
> pipeline(TdidfTransformer, Classifier) to the test dataset. On the
> training step, I am using a cross-validation and in each iteration I am
> applying tdidf.fit_transform/transform and classifer.fit/predict to train
> and test folds respectively. My question is: once I have Œtuned¹ the
> classifier parameters, do I have to Œfit_transform' the whole train
> dataset(for both, Tdidf and the classifier) and then transform and predict
> the test dataset?
>
> Thanks.
>
> El 18/08/14 18:43, "Olivier Grisel" <[email protected]> escribió:
>
> >2014-08-18 18:28 GMT+02:00  <[email protected]>:
> >>
> >>
> >>
> >> On Mon, Aug 18, 2014 at 12:15 PM, Olivier Grisel
> >><[email protected]>
> >> wrote:
> >>>
> >>> Le 18 août 2014 16:16, "Sebastian Raschka" <[email protected]> a
> >>>écrit
> >>> :
> >>>
> >>>
> >>> >
> >>> >
> >>> > On Aug 18, 2014, at 3:46 AM, Olivier Grisel
> >>><[email protected]>
> >>> > wrote:
> >>> >
> >>> > > But the sklearn.cross_validation.Bootstrap currently implemented in
> >>> > > sklearn is a cross validation iterator, not a generic resampling
> >>>method to
> >>> > > estimate variance or confidence intervals. Don't be mislead by the
> >>>name. If
> >>> > > we chose to deprecate and then remove this class, it's precisely
> >>>because it
> >>> > > causes confusion.
> >>> >
> >>> > Hm, I can kind of see why the Bootstrap calls was initially put into
> >>> > sklearn.cross_validation, technically, the "approaches" (cross
> >>>validation,
> >>> > bootstrap, jackknife) are very related. The only difference is that
> >>>you have
> >>> > sampling "with replacement" in the bootstrap approach and that you
> >>>would
> >>> > typically want to have >1000 iterations.
> >>>
> >>> > So, the suggestion would be to remove Bootstrap and use
> >>> > sklearn.utils.resample in future?
> >>>
> >>> Well it depends why do you want to use bootstrapping for. If it's for
> >>> model evaluation (estimation of some validation score), then the
> >>>recommended
> >>> way is to use ShuffleSplit or StratifiedShuffleSplit instead. If you
> >>>want
> >>> generic bootstrap estimation features such as confidence interval
> >>>estimation
> >>> (that does not exist in scikit-learn by the way), then I would
> >>>recommend you
> >>> to have a look at scikits.bootstrap [1] which also implement bias
> >>>correction
> >>> for skewed distribution which is non-trivial to do manually.
> >>>
> >>> [1] https://scikits.appspot.com/bootstrap
> >>>
> >>> sklearn.utils is meant only for internal use in the scikit-learn
> >>>project.
> >>> For instance sklearn.utils.resample is useful to implement resampling
> >>> internally in bagging models if I remember correctly.
> >>>
> >>> > I would say that it is good that the Bootstrap is implemented like
> >>>an CV
> >>> > object,
> >>>
> >>> I precisely think the opposite. There is no point in using sampling
> >>>with
> >>> replacement vs sampling without replacement to estimate the validation
> >>>error
> >>> of a model. Traditional strategies for cross-validation as implemented
> >>>in
> >>> Shuffle & Split are as flexible and simpler to interpret than our weird
> >>> Bootstrap cross-validation iterator.
> >>>
> >>> See also: http://youtu.be/BzHz0J9a6k0?t=9m38s
> >>>
> >>> > since it would make the "estimate" and "error" calculation more
> >>> > convenient, right?
> >>>
> >>> I don't understand what you mean "estimate" by "error". Both the model
> >>> parameters, its individual predictions and its cross-validation scores
> >>>or
> >>> errors can be called "estimates": anything that is derived from
> >>>sampled data
> >>> points is an estimate.
> >>
> >>
> >> Just a remark from the sidelines,
> >> (I hope to get bootstrap and cross-validation iterators into the next
> >> version of statsmodels, borrowing some of the ideas and code from
> >> scikit-learn, but emphasis in statsmodels will be on bootstrap and
> >> permutation iterators.)
> >>
> >> What I think sklearn doesn't have, are early stopping with randomized
> >> selection for cross-validation iterators. If LOO/jacknife are expensive
> >>to
> >> calculate for all LOO sets. Can you randomly select among the LOO sets,
> >>or
> >> similar for other iterators?
> >
> >No, but that's would be good idea for ShuffleSplit as well. If I
> >understand correctly, you would pass something like tolerance
> >parameter (e.g. I want a validation score with precise to 2 decimals)
> >and use as few iterations as possible to each that precision and then
> >stop sampling. Is that right?
> >
> >> Similar, permutation inference is often difficult because the set of
> >> permutations is getting too large, then bootstrap is the usual
> >>alternative
> >> for larger samples.
> >>
> >> (I may be incorrect since I only briefly looked at the changes to your
> >> cross-validation.)
> >
> >One thing to keep in mind is that sklearn.cross_validation.Bootstrap
> >is not the real bootstrap: it's a random permutation + split + random
> >sampling with replacement on both sides of the split independently:
> >
> >
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_val
> >idation.py#L718
> >
> >This 2 steps procedures is done to make sure that no test samples is
> >part of the training fold at each iteration. A more natural way to
> >respect that constraint would be to sample with replacement from the
> >full dataset and then use out-of-bags samples for the validation set.
> >But then you would loose control on the size of the test fold. This
> >second strategy is more like the real bootstrap and is the one I
> >should have implemented initially instead of the weird beast that
> >sklearn.cross_validation.Bootstrap is currently.
> >
> >--
> >Olivier
> >
> >--------------------------------------------------------------------------
> >----
> >_______________________________________________
> >Scikit-learn-general mailing list
> >[email protected]
> >https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ________________________________
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
>
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

Reply via email to