[ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732302#comment-14732302
 ] 

Sean Owen commented on SPARK-10433:
-----------------------------------

Quite possible; would that have resulted in excessively large inputs to each 
stage? I was seeing that megabytes of input to the trees suddenly became 
gigabytes over many iterations. The number of records exploded for some reason. 
That itself doesn't seem like a problem of long lineage but I might miss the 
connection.

> Gradient boosted trees
> ----------------------
>
>                 Key: SPARK-10433
>                 URL: https://issues.apache.org/jira/browse/SPARK-10433
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.4.1, 1.5.0
>            Reporter: Sean Owen
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
> .... 
> Input Size / Records: 1912.6 GB / 1382017686 
> .... 
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
>     boostingStrategy.setNumIterations(30)
>     boostingStrategy.setLearningRate(1.0)
>     boostingStrategy.treeStrategy.setMaxDepth(3)
>     boostingStrategy.treeStrategy.setMaxBins(128)
>     boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
>     boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
>     boostingStrategy.treeStrategy.setUseNodeIdCache(true)
>     boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>       
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to