Jianfei Wang created SPARK-18463:
------------------------------------
Summary: I think it's necessary to have an overrided method of
smaple
Key: SPARK-18463
URL: https://issues.apache.org/jira/browse/SPARK-18463
Project: Spark
Issue Type: New Feature
Components: Spark Core
Reporter: Jianfei Wang
Currently in this situation:
rdd3 = rdd1.zip(rdd2).sample()
if we can take sample on the two sample directly,such as
sample(rdd1,rdd2) ,so we can reduce the memory usage.
there are some use cases in spark mllib,such as in GradientBoostedTrees
while (m < numIterations && !doneLearning) {
// Update data with pseudo-residuals 剩余误差
val data = predError.zip(input).map { case ((pred, _), point) =>
LabeledPoint(-loss.gradient(pred, point.label), point.features)
}
val dt = new DecisionTreeRegressor().setSeed(seed + m)
val model = dt.train(data, treeStrategy)
when we use data to train model,we will do a sample.
so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such
cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]