[
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley resolved SPARK-10433.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.5.0
I'm closing this since it seems to have been fixed in 1.5, but please say if it
has occurred again after that.
> Gradient boosted trees: increasing input size in 1.4
> ----------------------------------------------------
>
> Key: SPARK-10433
> URL: https://issues.apache.org/jira/browse/SPARK-10433
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.4.1
> Reporter: Sean Owen
> Fix For: 1.5.0
>
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three
> different people and I confirmed it at fairly close range, so think it's
> legitimate:)
> This is probably best explained in the words from the mailing list thread at
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
> . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with
> around 300 features) and am noticing that the input size of each stage is
> increasing each iteration. For each new tree, the first step seems to be
> building the decision tree metadata, which does a .count() on the input data,
> so this is the step I've been using to track the input size changing. Here is
> what I'm seeing:
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111
> 1. Input Size / Records: 726.1 MB / 1295620
> 2. Input Size / Records: 106.9 GB / 64780816
> 3. Input Size / Records: 160.3 GB / 97171224
> 4. Input Size / Records: 214.8 GB / 129680959
> 5. Input Size / Records: 268.5 GB / 162533424
> ....
> Input Size / Records: 1912.6 GB / 1382017686
> ....
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so
> iteration. I'm not quite sure what could be causing this. I am passing a
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
> boostingStrategy.setNumIterations(30)
> boostingStrategy.setLearningRate(1.0)
> boostingStrategy.treeStrategy.setMaxDepth(3)
> boostingStrategy.treeStrategy.setMaxBins(128)
> boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
> boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
> boostingStrategy.treeStrategy.setUseNodeIdCache(true)
> boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
> java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]