[jira] [Resolved] (SPARK-10433) Gradient boosted trees: increasing input size in 1.4

Joseph K. Bradley (JIRA) Mon, 21 Mar 2016 15:22:34 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joseph K. Bradley resolved SPARK-10433.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.5.0

I'm closing this since it seems to have been fixed in 1.5, but please say if it 
has occurred again after that.

> Gradient boosted trees: increasing input size in 1.4
> ----------------------------------------------------
>
>                 Key: SPARK-10433
>                 URL: https://issues.apache.org/jira/browse/SPARK-10433
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.4.1
>            Reporter: Sean Owen
>             Fix For: 1.5.0
>
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
> .... 
> Input Size / Records: 1912.6 GB / 1382017686 
> .... 
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
>     boostingStrategy.setNumIterations(30)
>     boostingStrategy.setLearningRate(1.0)
>     boostingStrategy.treeStrategy.setMaxDepth(3)
>     boostingStrategy.treeStrategy.setMaxBins(128)
>     boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
>     boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
>     boostingStrategy.treeStrategy.setUseNodeIdCache(true)
>     boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>       
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-10433) Gradient boosted trees: increasing input size in 1.4

Reply via email to