[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670521#comment-15670521
 ] 

Jianfei Wang commented on SPARK-18463:
--------------------------------------

thank you very much ,some misunderstanding about this case

> I think it's necessary to have an overrided method of smaple
> ------------------------------------------------------------
>
>                 Key: SPARK-18463
>                 URL: https://issues.apache.org/jira/browse/SPARK-18463
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>       // Update data with pseudo-residuals 剩余误差
>       val data = predError.zip(input).map { case ((pred, _), point) =>
>         LabeledPoint(-loss.gradient(pred, point.label), point.features)
>       }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>       val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to