[ https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732302#comment-14732302 ]
Sean Owen commented on SPARK-10433: ----------------------------------- Quite possible; would that have resulted in excessively large inputs to each stage? I was seeing that megabytes of input to the trees suddenly became gigabytes over many iterations. The number of records exploded for some reason. That itself doesn't seem like a problem of long lineage but I might miss the connection. > Gradient boosted trees > ---------------------- > > Key: SPARK-10433 > URL: https://issues.apache.org/jira/browse/SPARK-10433 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.4.1, 1.5.0 > Reporter: Sean Owen > > (Sorry to say I don't have any leads on a fix, but this was reported by three > different people and I confirmed it at fairly close range, so think it's > legitimate:) > This is probably best explained in the words from the mailing list thread at > http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E > . Matt Forbes says: > {quote} > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input data, > so this is the step I've been using to track the input size changing. Here is > what I'm seeing: > {quote} > {code} > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > .... > Input Size / Records: 1912.6 GB / 1382017686 > .... > {code} > {quote} > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > {quote} > Johannes Bauer showed me a very similar problem. > Peter Rudenko offers this sketch of a reproduction: > {code} > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > val model = GradientBoostedTrees.train(instances, boostingStrategy) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org