Liang-Chi Hsieh created SPARK-20542:
---------------------------------------

             Summary: Add an API into Bucketizer that can bin a lot of columns 
all at once
                 Key: SPARK-20542
                 URL: https://issues.apache.org/jira/browse/SPARK-20542
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 2.2.0
            Reporter: Liang-Chi Hsieh


Current ML's Bucketizer can only bin a column of continuous features. If a 
dataset has thousands of of continuous columns needed to bin, we will result in 
thousands of ML stages. It is very inefficient regarding query planning and 
execution.

We should have a type of bucketizer that can bin a lot of columns all at once. 
It would need to accept an list of arrays of split points to correspond to the 
columns to bin, but it might make things more efficient by replacing thousands 
of stages with just one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to