Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

ZORAIDA HIDALGO SANCHEZ Tue, 19 Aug 2014 01:37:13 -0700

You right Joel. The options will be:

  *   Use the last vocabulary built. (Only including vocabulary for the last 
train fold) -> Only vocabulary in the last train fold. Underfitting?
  *   Use the whole vocabulary (as I proposed in the previous email: train + 
test folds) -> Whole vocabulary in the train dataset
  *   Apply  TfidfVectorizer.fit_transform to the test dataset -> Vocabulary in 
the test dataset. Overfitting?

Thanks!

De: Joel Nothman <[email protected]<mailto:[email protected]>>
Responder a: 
"[email protected]<mailto:[email protected]>"

<[email protected]<mailto:[email protected]>>
Fecha: martes, 19 de agosto de 2014 00:44
Para: scikit-learn-general 
<[email protected]<mailto:[email protected]>>
Asunto: Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

If I understand your question correctly, the answer is yes!

If you want a clearer response, you might clarify what the alternative 
hypothesis is to your suggestion.

On 19 August 2014 03:13, ZORAIDA HIDALGO SANCHEZ 
<[email protected]<mailto:[email protected]>>
 wrote:
I am using TdidfTransformer on documents that I need to classify. In order
to evaluate the model, I need to apply the whole
pipeline(TdidfTransformer, Classifier) to the test dataset. On the
training step, I am using a cross-validation and in each iteration I am
applying tdidf.fit_transform/transform and classifer.fit/predict to train
and test folds respectively. My question is: once I have Œtuned¹ the
classifier parameters, do I have to Œfit_transform' the whole train
dataset(for both, Tdidf and the classifier) and then transform and predict
the test dataset?

Thanks.

El 18/08/14 18:43, "Olivier Grisel" 
<[email protected]<mailto:[email protected]>> escribió:

>2014-08-18 18:28 GMT+02:00  
><[email protected]<mailto:[email protected]>>:
>>
>>
>>
>> On Mon, Aug 18, 2014 at 12:15 PM, Olivier Grisel
>><[email protected]<mailto:[email protected]>>
>> wrote:
>>>
>>> Le 18 août 2014 16:16, "Sebastian Raschka" 
>>> <[email protected]<mailto:[email protected]>> a
>>>écrit
>>> :
>>>
>>>
>>> >
>>> >
>>> > On Aug 18, 2014, at 3:46 AM, Olivier Grisel
>>><[email protected]<mailto:[email protected]>>
>>> > wrote:
>>> >
>>> > > But the sklearn.cross_validation.Bootstrap currently implemented in
>>> > > sklearn is a cross validation iterator, not a generic resampling
>>>method to
>>> > > estimate variance or confidence intervals. Don't be mislead by the
>>>name. If
>>> > > we chose to deprecate and then remove this class, it's precisely
>>>because it
>>> > > causes confusion.
>>> >
>>> > Hm, I can kind of see why the Bootstrap calls was initially put into
>>> > sklearn.cross_validation, technically, the "approaches" (cross
>>>validation,
>>> > bootstrap, jackknife) are very related. The only difference is that
>>>you have
>>> > sampling "with replacement" in the bootstrap approach and that you
>>>would
>>> > typically want to have >1000 iterations.
>>>
>>> > So, the suggestion would be to remove Bootstrap and use
>>> > sklearn.utils.resample in future?
>>>
>>> Well it depends why do you want to use bootstrapping for. If it's for
>>> model evaluation (estimation of some validation score), then the
>>>recommended
>>> way is to use ShuffleSplit or StratifiedShuffleSplit instead. If you
>>>want
>>> generic bootstrap estimation features such as confidence interval
>>>estimation
>>> (that does not exist in scikit-learn by the way), then I would
>>>recommend you
>>> to have a look at scikits.bootstrap [1] which also implement bias
>>>correction
>>> for skewed distribution which is non-trivial to do manually.
>>>
>>> [1] https://scikits.appspot.com/bootstrap
>>>
>>> sklearn.utils is meant only for internal use in the scikit-learn
>>>project.
>>> For instance sklearn.utils.resample is useful to implement resampling
>>> internally in bagging models if I remember correctly.
>>>
>>> > I would say that it is good that the Bootstrap is implemented like
>>>an CV
>>> > object,
>>>
>>> I precisely think the opposite. There is no point in using sampling
>>>with
>>> replacement vs sampling without replacement to estimate the validation
>>>error
>>> of a model. Traditional strategies for cross-validation as implemented
>>>in
>>> Shuffle & Split are as flexible and simpler to interpret than our weird
>>> Bootstrap cross-validation iterator.
>>>
>>> See also: http://youtu.be/BzHz0J9a6k0?t=9m38s
>>>
>>> > since it would make the "estimate" and "error" calculation more
>>> > convenient, right?
>>>
>>> I don't understand what you mean "estimate" by "error". Both the model
>>> parameters, its individual predictions and its cross-validation scores
>>>or
>>> errors can be called "estimates": anything that is derived from
>>>sampled data
>>> points is an estimate.
>>
>>
>> Just a remark from the sidelines,
>> (I hope to get bootstrap and cross-validation iterators into the next
>> version of statsmodels, borrowing some of the ideas and code from
>> scikit-learn, but emphasis in statsmodels will be on bootstrap and
>> permutation iterators.)
>>
>> What I think sklearn doesn't have, are early stopping with randomized
>> selection for cross-validation iterators. If LOO/jacknife are expensive
>>to
>> calculate for all LOO sets. Can you randomly select among the LOO sets,
>>or
>> similar for other iterators?
>
>No, but that's would be good idea for ShuffleSplit as well. If I
>understand correctly, you would pass something like tolerance
>parameter (e.g. I want a validation score with precise to 2 decimals)
>and use as few iterations as possible to each that precision and then
>stop sampling. Is that right?
>
>> Similar, permutation inference is often difficult because the set of
>> permutations is getting too large, then bootstrap is the usual
>>alternative
>> for larger samples.
>>
>> (I may be incorrect since I only briefly looked at the changes to your
>> cross-validation.)
>
>One thing to keep in mind is that sklearn.cross_validation.Bootstrap
>is not the real bootstrap: it's a random permutation + split + random
>sampling with replacement on both sides of the split independently:
>
>https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_val
>idation.py#L718
>
>This 2 steps procedures is done to make sure that no test samples is
>part of the training fold at each iteration. A more natural way to
>respect that constraint would be to sample with replacement from the
>full dataset and then use out-of-bags samples for the validation set.
>But then you would loose control on the size of the test fold. This
>second strategy is more like the real bootstrap and is the one I
>should have implemented initially instead of the weird beast that
>sklearn.cross_validation.Bootstrap is currently.
>
>--
>Olivier
>
>--------------------------------------------------------------------------
>----
>_______________________________________________
>Scikit-learn-general mailing list
>[email protected]<mailto:[email protected]>
>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] TdidfTransformer when applied to test dataset

Reply via email to