[
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-24467:
------------------------------------
Assignee: Apache Spark
> VectorAssemblerEstimator
> ------------------------
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Affects Versions: 2.4.0
> Reporter: Joseph K. Bradley
> Assignee: Apache Spark
> Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since
> I thought the latter option would break most workflows. However, I should
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the
> inputCols. This Param can be optional. If not given, then VectorAssembler
> will behave as it does now. If given, then VectorAssembler can use that info
> instead of figuring out the Vector sizes via metadata or examining Rows in
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows. Migrating to
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other
> things in the future which require vector length metadata, so we could
> consider keeping it rather than deprecating it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]