Thanks for the suggestion. That would be faster, but less accurate in most cases. It's generally better to use a new random sample on each iteration, based on literature and results I've seen. Joseph
On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei < wangjianfe...@otcaix.iscas.ac.cn> wrote: > when we train the mode, we will use the data with a subSampleRate, so if > the > subSampleRate < 1.0 , we can do a sample first to reduce the memory usage. > se the code below in GradientBoostedTrees.boost() > > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > > timer.start(s"building tree $m") > logDebug("###################################################") > logDebug("Gradient boosting tree iteration " + m) > logDebug("###################################################") > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > > > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Reduce-the-memory- > usage-if-we-do-same-first-in-GradientBoostedTrees-if- > subsamplingRate-1-0-tp19826.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >