[
https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212932#comment-16212932
]
Michael Mior commented on SPARK-19007:
--------------------------------------
I see from the following statement from the PR discussion, but I don't
understand why this causes a problem.
bq. it had to do with the fact that RDDs may be materialized later than
checkpointer.update() gets called.
> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> ------------------------------------------------------------------------
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G
> memory、40 cores、43TB disk per server).
> Reporter: zhangdenghui
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Test data:80G CTR training data from
> criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/
> ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new
> generated continuous features,the way to generate the new features refers to
> the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per
> executor.
> Parameters: numIterations 10, maxdepth 8, the rest parameters are default
> I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data
> mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT
> rounds later.Without these task failures and task retry it can be much faster
> ,which can save about half the time. I think it's caused by the RDD named
> predError in the while loop of the boost method at
> GradientBoostedTrees.scala,because the lineage of the RDD named predError is
> growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10
> GB physical memory used. Consider boosting
> spark.yarn.executor.memoryOverhead.).
> I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it
> needed is too much (even increase half the memory can't solve the problem)
> so i think it's not a proper method.
> Although it can set the predCheckpoint Interval smaller to cut the line of
> the lineage but it increases IO cost a lot.
> I tried another way to solve this problem.I persisted the RDD named
> predError every round and use pre_predError to record the previous RDD and
> unpersist it because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure
> occured and no more memeory added.
> So when the data is much larger than memory, my little improvement can
> speedup the GradientBoostedTrees 1~2 times.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]