with predError.zip(input) ,we get RDD data, so we can just do a sample on predError or input, if so, we can't use zip(the elements number must be the same in each partition),thank you!
------------------ 原始邮件 ------------------ 发件人: "Joseph Bradley [via Apache Spark Developers List]";<ml-node+s1001551n19899...@n3.nabble.com>; 发送时间: 2016年11月16日(星期三) 凌晨3:54 收件人: "WangJianfei"<wangjianfe...@otcaix.iscas.ac.cn>; 主题: Re: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0 Thanks for the suggestion. That would be faster, but less accurate in most cases. It's generally better to use a new random sample on each iteration, based on literature and results I've seen.Joseph On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei <[hidden email]> wrote: when we train the mode, we will use the data with a subSampleRate, so if the subSampleRate < 1.0 , we can do a sample first to reduce the memory usage. se the code below in GradientBoostedTrees.boost() while (m < numIterations && !doneLearning) { // Update data with pseudo-residuals 剩余误差 val data = predError.zip(input).map { case ((pred, _), point) => LabeledPoint(-loss.gradient(pred, point.label), point.features) } timer.start(s"building tree $m") logDebug("###################################################") logDebug("Gradient boosting tree iteration " + m) logDebug("###################################################") val dt = new DecisionTreeRegressor().setSeed(seed + m) val model = dt.train(data, treeStrategy) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-in-GradientBoostedTrees-if-subsamplingRate-1-0-tp19826.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: [hidden email] If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-sample-first-in-GradientBoostedTrees-with-the-condition-that-subsam0-tp19826p19899.html To unsubscribe from Reduce the memory usage if we do sample first in GradientBoostedTrees with the condition that subsamplingRate< 1.0, click here. NAML -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-inGradientBoostedTrees-if-subsamplingRate-1-0-tp19904.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.