[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251391#comment-15251391
 ] 

Nick Pentreath commented on SPARK-14760:
----------------------------------------

In general, given the name {{transformSchema}}, one would expect the method to 
actually transform the input schema into the output schema. This is the case, 
but only a few transformers actually seem to use the output schema returned 
from {{transformSchema}}. Hence, the output schema enforced in 
{{transformSchema}} is not actually enforced in {{fit}} or {{transform}}.

So in a Pipeline, you can call {{transformSchema}} for each stage, which 
performs validation upfront, but if the individual transformers don't enforce 
the output schema returned, you can have a situation where the schema 
validation succeeds but a pipeline stage does something different and breaks it.

IMO the approach used by those examples {{HashingTF}}, {{Binarizer}} is correct 
and other transformers should do the same, no?

> Feature transformers should always invoke transformSchema in transform or fit
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-14760
>                 URL: https://issues.apache.org/jira/browse/SPARK-14760
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: yuhao yang
>            Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to