[ https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041620#comment-15041620 ]
Apache Spark commented on SPARK-12026: -------------------------------------- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/10146 > ChiSqTest gets slower and slower over time when number of features is large > --------------------------------------------------------------------------- > > Key: SPARK-12026 > URL: https://issues.apache.org/jira/browse/SPARK-12026 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.5.2 > Reporter: Hunter Kelly > Labels: mllib, stats > Attachments: First Stages.png, Latest Stages.png > > > I've been running a ChiSqTest to pick features for feature reduction. My > understanding is that internally it creates jobs to run on batches of 1000 > features at a time. > I was under the impression that the features are treated as independant, but > this does not appear to be the case. When the number of features is large > (160k in my case), each batch gets slower and slower. As an example, running > on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch. > By the end, batches were taking over 30 minutes per batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org