GitHub user tengpeng opened a pull request:

    https://github.com/apache/spark/pull/21522

    [SPARK-24467][ML] VectorAssemblerEstimator

    Background: See the JIRA ticket.
    
    This PR is on its very early stage, and hopefully it would help us decide 
what's the right direction.
    
    ## What changes were proposed in this pull request? 
    
    1. Add a optional Param to VectorAssembler for specifying the sizes of 
Vectors in the inputCols. 
    - If not given, then VectorAssembler will behave as it does now. 
    - If given, then VectorAssembler can use that info instead of figuring out 
the Vector sizes via metadata or examining Rows in the data. And it does 
consistency checks.
    2. Add a VectorAssemblerEstimator which gets the Vector lengths from data 
and produces a VectorAssembler_Model_ with the vector lengths Param specified.
    
    Todos:
    1. Reduce code duplication. Not sure if want to have a trait that reduces 
duplication between `VectorAssembler` and `VectorAssemblerEstimator`, like 
'OneHotEncoderBase'.
    2. comments & documentations etc.
    
    
    ## How was this patch tested?
    Added unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tengpeng/spark Spark-24467

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21522.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21522
    
----
commit 8e3aa44c3937d60d5aa35dd03604e57ef218ebb4
Author: Teng Peng <josephtengpeng@...>
Date:   2018-06-09T12:48:30Z

    Add a param to VectorAssembler for specifying the sizes of Vectors. Add a 
VectorAssemblerEstimator.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to