[
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251391#comment-15251391
]
Nick Pentreath commented on SPARK-14760:
----------------------------------------
In general, given the name {{transformSchema}}, one would expect the method to
actually transform the input schema into the output schema. This is the case,
but only a few transformers actually seem to use the output schema returned
from {{transformSchema}}. Hence, the output schema enforced in
{{transformSchema}} is not actually enforced in {{fit}} or {{transform}}.
So in a Pipeline, you can call {{transformSchema}} for each stage, which
performs validation upfront, but if the individual transformers don't enforce
the output schema returned, you can have a situation where the schema
validation succeeds but a pipeline stage does something different and breaks it.
IMO the approach used by those examples {{HashingTF}}, {{Binarizer}} is correct
and other transformers should do the same, no?
> Feature transformers should always invoke transformSchema in transform or fit
> -----------------------------------------------------------------------------
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: yuhao yang
> Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter
> validation, transformers should always invoke transformSchema in transform
> and fit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]