[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251324#comment-15251324 ]
yuhao yang commented on SPARK-14760: ------------------------------------ [~holdenkarau] also shared some thoughts in the PR. In my opinion, the design needs to cover 1. Pipeline scenario: We can use transformSchema to conduct validation before actual transform happens. This is especially helpful and efficient in a pipeline. 2. A transformer is used independently: transformSchema mainly provides the validation function in this case. And actually, some transformers are using transformSchema to get the output schema in transform, such like HashingTF, Binarizer, ChiSqSelectorModel. >From design perspective, transformSchema should cover validation (including >friendly error handling) and schema transform, thus that transform/fit can >trust the dataset meets certain hypothesis. That's why this jira is created. > Feature transformers should always invoke transformSchema in transform or fit > ----------------------------------------------------------------------------- > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: yuhao yang > Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org