[ https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-23805: ------------------------------------ Assignee: (was: Apache Spark) > support vector-size validation and Inference > -------------------------------------------- > > Key: SPARK-23805 > URL: https://issues.apache.org/jira/browse/SPARK-23805 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.4.0 > Reporter: zhengruifeng > Priority: Major > > I think it maybe miningful to unify the usage of \{{AttributeGroup}} and > support vector-size validation and inference in algs. > My thoughts are: > * In \{{transformSchema}}, validate the input vector-size if possible. If > the input vector-size can be obtained from schema, check it. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > require the vector-size to be no more than 4. > ** Suppose a \{{PCAModel}} trained with vectors of length 10, the > \{{transformSchema}} will require the vector-size to be 10. > * In \{{transformSchema}}, inference the output vector-size if possible. > ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will > return a schema with output vector-size=4. > * In \{{transform}}, inference the output vector-size if possible. > * In \{{fit}}, obtain the input vector-size from schema if possible. This > can help eliminating redundant \{{first}} jobs. > > Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my > idea. Since the validation and inference is quite alg-speciafic, we may need > to sperate the task into several small subtasks. > How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org