GitHub user tengpeng opened a pull request:
https://github.com/apache/spark/pull/21522
[SPARK-24467][ML] VectorAssemblerEstimator
Background: See the JIRA ticket.
This PR is on its very early stage, and hopefully it would help us decide
what's the right direction.
## What changes were proposed in this pull request?
1. Add a optional Param to VectorAssembler for specifying the sizes of
Vectors in the inputCols.
- If not given, then VectorAssembler will behave as it does now.
- If given, then VectorAssembler can use that info instead of figuring out
the Vector sizes via metadata or examining Rows in the data. And it does
consistency checks.
2. Add a VectorAssemblerEstimator which gets the Vector lengths from data
and produces a VectorAssembler_Model_ with the vector lengths Param specified.
Todos:
1. Reduce code duplication. Not sure if want to have a trait that reduces
duplication between `VectorAssembler` and `VectorAssemblerEstimator`, like
'OneHotEncoderBase'.
2. comments & documentations etc.
## How was this patch tested?
Added unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tengpeng/spark Spark-24467
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21522.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21522
----
commit 8e3aa44c3937d60d5aa35dd03604e57ef218ebb4
Author: Teng Peng <josephtengpeng@...>
Date: 2018-06-09T12:48:30Z
Add a param to VectorAssembler for specifying the sizes of Vectors. Add a
VectorAssemblerEstimator.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]