[jira] [Commented] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large

Apache Spark (JIRA) Fri, 04 Dec 2015 06:51:15 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041620#comment-15041620
 ]


Apache Spark commented on SPARK-12026:
--------------------------------------

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/10146

> ChiSqTest gets slower and slower over time when number of features is large
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-12026
>                 URL: https://issues.apache.org/jira/browse/SPARK-12026
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.5.2
>            Reporter: Hunter Kelly
>              Labels: mllib, stats
>         Attachments: First Stages.png, Latest Stages.png
>
>
> I've been running a ChiSqTest to pick features for feature reduction.  My 
> understanding is that internally it creates jobs to run on batches of 1000 
> features at a time.
> I was under the impression that the features are treated as independant, but 
> this does not appear to be the case.  When the number of features is large 
> (160k in my case), each batch gets slower and slower.  As an example, running 
> on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch.  
> By the end, batches were taking over 30 minutes per batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large

Reply via email to