Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

Joel Nothman Sun, 24 Aug 2014 20:56:16 -0700

I think you are welcome to fit on all data that those models will not be
evaluated on!



On 19 August 2014 18:35, ZORAIDA HIDALGO SANCHEZ <
[email protected]> wrote:

>  You right Joel. The options will be:
>
>    - Use the last vocabulary built. (Only including vocabulary for the
>    last train fold) -> Only vocabulary in the last train fold. Underfitting?
>    - Use the whole vocabulary (as I proposed in the previous email: train
>    + test folds) -> Whole vocabulary in the train dataset
>    - Apply  TfidfVectorizer.fit_transform to the test dataset ->
>    Vocabulary in the test dataset. Overfitting?
>
> Thanks!
>
>   De: Joel Nothman <[email protected]>
> Responder a: "[email protected]" <
> [email protected]>
> Fecha: martes, 19 de agosto de 2014 00:44
> Para: scikit-learn-general <[email protected]>
> Asunto: Re: [Scikit-learn-general] TdidfTransformer when applied to test
> dataset
>
>   If I understand your question correctly, the answer is yes!
>
>  If you want a clearer response, you might clarify what the alternative
> hypothesis is to your suggestion.
>
>
> On 19 August 2014 03:13, ZORAIDA HIDALGO SANCHEZ <
> [email protected]> wrote:
>
>> I am using TdidfTransformer on documents that I need to classify. In order
>> to evaluate the model, I need to apply the whole
>> pipeline(TdidfTransformer, Classifier) to the test dataset. On the
>> training step, I am using a cross-validation and in each iteration I am
>> applying tdidf.fit_transform/transform and classifer.fit/predict to train
>> and test folds respectively. My question is: once I have Œtuned¹ the
>> classifier parameters, do I have to Œfit_transform' the whole train
>> dataset(for both, Tdidf and the classifier) and then transform and predict
>> the test dataset?
>>
>> Thanks.
>>
>> El 18/08/14 18:43, "Olivier Grisel" <[email protected]> escribió:
>>
>> >2014-08-18 18:28 GMT+02:00  <[email protected]>:
>> >>
>> >>
>> >>
>> >> On Mon, Aug 18, 2014 at 12:15 PM, Olivier Grisel
>> >><[email protected]>
>> >> wrote:
>> >>>
>> >>> Le 18 août 2014 16:16, "Sebastian Raschka" <[email protected]> a
>> >>>écrit
>> >>> :
>> >>>
>> >>>
>> >>> >
>> >>> >
>> >>> > On Aug 18, 2014, at 3:46 AM, Olivier Grisel
>> >>><[email protected]>
>> >>> > wrote:
>> >>> >
>> >>> > > But the sklearn.cross_validation.Bootstrap currently implemented
>> in
>> >>> > > sklearn is a cross validation iterator, not a generic resampling
>> >>>method to
>> >>> > > estimate variance or confidence intervals. Don't be mislead by the
>> >>>name. If
>> >>> > > we chose to deprecate and then remove this class, it's precisely
>> >>>because it
>> >>> > > causes confusion.
>> >>> >
>> >>> > Hm, I can kind of see why the Bootstrap calls was initially put into
>> >>> > sklearn.cross_validation, technically, the "approaches" (cross
>> >>>validation,
>> >>> > bootstrap, jackknife) are very related. The only difference is that
>> >>>you have
>> >>> > sampling "with replacement" in the bootstrap approach and that you
>> >>>would
>> >>> > typically want to have >1000 iterations.
>> >>>
>> >>> > So, the suggestion would be to remove Bootstrap and use
>> >>> > sklearn.utils.resample in future?
>> >>>
>> >>> Well it depends why do you want to use bootstrapping for. If it's for
>> >>> model evaluation (estimation of some validation score), then the
>> >>>recommended
>> >>> way is to use ShuffleSplit or StratifiedShuffleSplit instead. If you
>> >>>want
>> >>> generic bootstrap estimation features such as confidence interval
>> >>>estimation
>> >>> (that does not exist in scikit-learn by the way), then I would
>> >>>recommend you
>> >>> to have a look at scikits.bootstrap [1] which also implement bias
>> >>>correction
>> >>> for skewed distribution which is non-trivial to do manually.
>> >>>
>> >>> [1] https://scikits.appspot.com/bootstrap
>> >>>
>> >>> sklearn.utils is meant only for internal use in the scikit-learn
>> >>>project.
>> >>> For instance sklearn.utils.resample is useful to implement resampling
>> >>> internally in bagging models if I remember correctly.
>> >>>
>> >>> > I would say that it is good that the Bootstrap is implemented like
>> >>>an CV
>> >>> > object,
>> >>>
>> >>> I precisely think the opposite. There is no point in using sampling
>> >>>with
>> >>> replacement vs sampling without replacement to estimate the validation
>> >>>error
>> >>> of a model. Traditional strategies for cross-validation as implemented
>> >>>in
>> >>> Shuffle & Split are as flexible and simpler to interpret than our
>> weird
>> >>> Bootstrap cross-validation iterator.
>> >>>
>> >>> See also: http://youtu.be/BzHz0J9a6k0?t=9m38s
>> >>>
>> >>> > since it would make the "estimate" and "error" calculation more
>> >>> > convenient, right?
>> >>>
>> >>> I don't understand what you mean "estimate" by "error". Both the model
>> >>> parameters, its individual predictions and its cross-validation scores
>> >>>or
>> >>> errors can be called "estimates": anything that is derived from
>> >>>sampled data
>> >>> points is an estimate.
>> >>
>> >>
>> >> Just a remark from the sidelines,
>> >> (I hope to get bootstrap and cross-validation iterators into the next
>> >> version of statsmodels, borrowing some of the ideas and code from
>> >> scikit-learn, but emphasis in statsmodels will be on bootstrap and
>> >> permutation iterators.)
>> >>
>> >> What I think sklearn doesn't have, are early stopping with randomized
>> >> selection for cross-validation iterators. If LOO/jacknife are expensive
>> >>to
>> >> calculate for all LOO sets. Can you randomly select among the LOO sets,
>> >>or
>> >> similar for other iterators?
>> >
>> >No, but that's would be good idea for ShuffleSplit as well. If I
>> >understand correctly, you would pass something like tolerance
>> >parameter (e.g. I want a validation score with precise to 2 decimals)
>> >and use as few iterations as possible to each that precision and then
>> >stop sampling. Is that right?
>> >
>> >> Similar, permutation inference is often difficult because the set of
>> >> permutations is getting too large, then bootstrap is the usual
>> >>alternative
>> >> for larger samples.
>> >>
>> >> (I may be incorrect since I only briefly looked at the changes to your
>> >> cross-validation.)
>> >
>> >One thing to keep in mind is that sklearn.cross_validation.Bootstrap
>> >is not the real bootstrap: it's a random permutation + split + random
>> >sampling with replacement on both sides of the split independently:
>> >
>> >
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_val
>> >idation.py#L718
>> >
>> >This 2 steps procedures is done to make sure that no test samples is
>> >part of the training fold at each iteration. A more natural way to
>> >respect that constraint would be to sample with replacement from the
>> >full dataset and then use out-of-bags samples for the validation set.
>> >But then you would loose control on the size of the test fold. This
>> >second strategy is more like the real bootstrap and is the one I
>> >should have implemented initially instead of the weird beast that
>> >sklearn.cross_validation.Bootstrap is currently.
>> >
>> >--
>> >Olivier
>> >
>>
>> >--------------------------------------------------------------------------
>> >----
>> >_______________________________________________
>> >Scikit-learn-general mailing list
>> >[email protected]
>> >https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>> ________________________________
>>
>> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
>> puede contener información privilegiada o confidencial y es para uso
>> exclusivo de la persona o entidad de destino. Si no es usted. el
>> destinatario indicado, queda notificado de que la lectura, utilización,
>> divulgación y/o copia sin autorización puede estar prohibida en virtud de
>> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
>> que nos lo comunique inmediatamente por esta misma vía y proceda a su
>> destrucción.
>>
>> The information contained in this transmission is privileged and
>> confidential information intended only for the use of the individual or
>> entity named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that any dissemination, distribution or
>> copying of this communication is strictly prohibited. If you have received
>> this transmission in error, do not read it. Please immediately reply to the
>> sender that you have received this communication in error and then delete
>> it.
>>
>> Esta mensagem e seus anexos se dirigem exclusivamente ao seu
>> destinatário, pode conter informação privilegiada ou confidencial e é para
>> uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o
>> destinatário indicado, fica notificado de que a leitura, utilização,
>> divulgação e/ou cópia sem autorização pode estar proibida em virtude da
>> legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos
>> o comunique imediatamente por esta mesma via e proceda a sua destruição
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
>
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

Reply via email to