Jianfei Wang created SPARK-18463:
------------------------------------

             Summary: I think it's necessary to have an overrided method of 
smaple
                 Key: SPARK-18463
                 URL: https://issues.apache.org/jira/browse/SPARK-18463
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
            Reporter: Jianfei Wang


Currently in this situation: 
rdd3 = rdd1.zip(rdd2).sample()
if we can take sample on the two sample directly,such as
sample(rdd1,rdd2) ,so we can reduce the memory usage.

there are some use cases in spark mllib,such as in GradientBoostedTrees

 while (m < numIterations && !doneLearning) {
      // Update data with pseudo-residuals 剩余误差
      val data = predError.zip(input).map { case ((pred, _), point) =>
        LabeledPoint(-loss.gradient(pred, point.label), point.features)
      }
val dt = new DecisionTreeRegressor().setSeed(seed + m)
      val model = dt.train(data, treeStrategy)

when we use data to train model,we will do a sample.
so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to