[ 
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212932#comment-16212932
 ] 

Michael Mior commented on SPARK-19007:
--------------------------------------

I see from the following statement from the PR discussion, but I don't 
understand why this causes a problem.

bq. it had to do with the fact that RDDs may be materialized later than 
checkpointer.update() gets called.

> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> ------------------------------------------------------------------------
>
>                 Key: SPARK-19007
>                 URL: https://issues.apache.org/jira/browse/SPARK-19007
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.0.1, 2.0.2, 2.1.0
>         Environment: A CDH cluster consists of 3 redhat server ,(120G 
> memory、40 cores、43TB disk per server).
>            Reporter: zhangdenghui
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Test data:80G CTR training data from 
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
>  ) ,I used 1 of the 24 days' data.Some  features needed to be repalced by new 
> generated continuous features,the way to generate the new features refers to 
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per 
> executor.
> Parameters: numIterations 10, maxdepth  8,   the rest parameters are default
> I tested the GradientBoostedTrees algorithm in mllib  using 80G CTR data 
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT 
> rounds later.Without these task failures and task retry it can be much faster 
> ,which can save about half the time. I think it's caused by the RDD named 
> predError in the while loop of  the boost method at 
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is 
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 
> GB physical memory used. Consider boosting 
> spark.yarn.executor.memoryOverhead.).  
> I tried to boosting spark.yarn.executor.memoryOverhead  but the meomry it 
> needed is too much (even increase half the memory  can't solve the problem) 
> so i think it's not a proper method. 
> Although it can set the predCheckpoint  Interval  smaller  to cut the line of 
> the lineage  but it increases IO cost a lot. 
> I tried  another way to solve this problem.I persisted the RDD named 
> predError every round  and use  pre_predError to record the previous RDD  and 
> unpersist it  because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure 
> occured and no more memeory added. 
> So when the data is much larger than memory, my little improvement can 
> speedup  the  GradientBoostedTrees  1~2 times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to