On 1/21/20 8:23 PM, Charles Pehlivanian wrote:
I understand - I'm kind of conflating the idea of data sample with test set, my 
view assumes there are a sample space of samples, might require rethinking the 
cross-validation setup...
I also think that part of it relies on the notion of online vs. offline 
algorithm. For offline fits, a batch transform (non-subset invariant) is 
preferred. For a transformer that can only be used in an online sense, or is 
primarily used that way, keep the invariant.

I see 3 options here - all I can say is that I don't vote for the first
+ No transform method on the manifold learners, so no cross-validation
This is what I thought we usually do. It looks like you said we are doing a greedy transform. I'm not sure I follow that. In particular for spectral embedding for example there is a pretty way to describe the transform and that's what we're doing. You could also look at doing transductive learning but that's
not really the standard formulation, is it?

+ Pointwise, distributable, subset-invariant, suboptimal greedy transform
+ Non-distributable, non-subset-invariant, optimal batch transform
Can you give an example of that?
-Charles
On Mon., Jan. 20, 21:24:52 2020 <joel.nothman at gmail.com  
<mailto:scikit-learn%40python.org?Subject=Re%3A%20%5Bscikit-learn%5D%20Why%20is%20subset%20invariance%20necessary%20for%0A%20transfom%28%29%3F&In-Reply-To=%3CCAAkaFLWfWyu%2BDdQ3RX5tBays6jLX6A3W_QpqAcWn_RAxbRz5cQ%40mail.gmail.com%3E>>
 wrote
I think allowing subset invariance to not hold is making stronger
assumptions than we usually do about what it means to have a "test set".
Having a transformation like this that relies on test set statistics
implies that the test set is more than just selected samples, but rather
that a large collection of samples is available at one time, and that it is
in some sense sufficient or complete (no more samples are available that
would give a better fit). So in a predictive modelling context you might
have to set up your cross validation splits with this in mind.

In terms of API, the subset invariance constraint allows us to assume that
the transformation can be distributed or parallelized over samples. I'm not
sure whether we have exploited that assumption within scikit-learn or
whether related projects do so.

I see the benefit of using such transformations in a prediction Pipeline,
and really appreciate this challenge to our assumptions of what "transform"
means.

Joel

On Tue., 21 Jan. 2020, 11:50 am Charles Pehlivanian, <
pehlivaniancharles at gmail.com  
<https://mail.python.org/mailman/listinfo/scikit-learn>> wrote:

>/Not all data transformers have a transform method. For those that do, />/subset invariance is assumed as expressed />/in check_methods_subset_invariance(). It must be the case that />/T.transform(X)[i] == T.transform(X[i:i+1]), e.g. This is true for classic />/projections - PCA, kernel PCA, etc., but not for some manifold learning />/transformers - MDS, SpectralEmbedding, etc. For those, an optimal placement />/of the data in space is a constrained optimization, may take into account />/the centroid of the dataset etc. />//>/The manifold learners have "batch" oos transform() methods that aren't />/implemented, and wouldn't pass that test. Instead, those that do - />/LocallyLinearEmbedding - use a pointwise version, essentially replacing a />/batch fit with a suboptimal greedy one [for LocallyLinearEmbedding]: />//>/for i in range(X.shape[0]): />/X_new[i] = np.dot(self.embedding_[ind[i]].T, weights[i]) />//>/Where to implement the batch transform() methods for MDS, />/SpectralEmbedding, LocallyLinearEmbedding, etc? />//>/Another verb? Both batch and pointwise versions? The latter is easy to />/implement once the batch version exists. Relax the test conditions? />/transform() is necessary for oos testing, so necessary for cross />/validation. The batch versions should be preferred, although as it stands, />/the pointwise versions are. />//>/Thanks />/Charles Pehlivanian />/_______________________________________________ />/scikit-learn mailing list />/scikit-learn at python.org <https://mail.python.org/mailman/listinfo/scikit-learn> />/https://mail.python.org/mailman/listinfo/scikit-learn />//-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.python.org/pipermail/scikit-learn/attachments/20200121/b402c42e/attachment.html>


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to