[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506334#comment-16506334
 ] 

Nick Pentreath edited comment on SPARK-24467 at 6/8/18 5:59 PM:
----------------------------------------------------------------

Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel. Though perhaps the 
existing one can be made a Model without breaking things}}


was (Author: mlnick):
Yeah the estimator would return a {{Model}} from {{fit}} right? So I don't 
think a new estimator could return the existing {{VectorAssembler}} but would 
probably need to return a new {{VectorAssemblerModel}}

> VectorAssemblerEstimator
> ------------------------
>
>                 Key: SPARK-24467
>                 URL: https://issues.apache.org/jira/browse/SPARK-24467
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: Joseph K. Bradley
>            Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to