[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506334#comment-16506334 ]
Nick Pentreath edited comment on SPARK-24467 at 6/8/18 5:59 PM: ---------------------------------------------------------------- Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel. Though perhaps the existing one can be made a Model without breaking things}} was (Author: mlnick): Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't think a new estimator could return the existing {{VectorAssembler}} but would probably need to return a new {{VectorAssemblerModel}} > VectorAssemblerEstimator > ------------------------ > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.4.0 > Reporter: Joseph K. Bradley > Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org