[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154051#comment-14154051 ]
Manish Amde commented on SPARK-1547: ------------------------------------ Adding Hirakendu's feedback on checkpointing below. ~~~~~~~~~~~~~ Hi Manish, Just looked at the JIRA, quite impressive progress. Will create a JIRA account and submit the Apache CLA, long time procrastinating :). About checkpointing, as I may have discussed before, I had a very crude solution to the problem with long lineage chains. I simply cache the residual dataset periodically after a certain number of iterations. I uncache the previous cached dataset prior to caching the current one. It does materialize the dataset, but apparently doesn't break the lineage graph. Nonetheless, in practice I see that speed improves to as if I started afresh. It's a crude form of checkpointing, but it would be good to do the real thing. About the choice of interval, obviously it depends on the dataset. Suppose there is a linear increase in time with iterations (the map operation to subtract the previous tree predictions), and say it takes time t for each such map operation. If we break the chain every B iterations, then the time for such map operations is t + 2t + ... + Bt ~= B^2 * t. Suppose the time taken to checkpoint/materialize is c, then the time for the period of B iterations is B^2 * t + c. Thus, the time per iteration is (B^2 * t + c)/B = B*t + c/B. Thus, B = sqrt(c/t). (Ignoring the -1s and factor of 2s in B^2 * t.) Of course, t and c depend on the dataset. So setting aside all calculations, in my implementation B is a user defined parameter. In practice, I run it once and see how much time it takes current iteration and when it slows down to the extent that it's same as the first iteration. It means might have as started afresh. Note that mathematically, for optimal B = sqrt(c/t), the last iteration takes time B^2 t = c, which is same as the checkpoint time. Clearly, this can be automated by keeping track of iteration times, just subtract the actual computation time and leave some room for noisy running times. Nonetheless, would be good to have a user defined override by a optional parameter. I think a similar optional parameter may also be provided to override the automatically determined batch size for deep trees. Btw, the comment in the JIRA about support for sparse vectors is something nice to have and something I have been thinking about. Sparse datasets are now far too common and flexible, although the go-to solution is by linear models. Thanks, Hirakendu. > Add gradient boosting algorithm to MLlib > ---------------------------------------- > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.0.0 > Reporter: Manish Amde > Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation > [Ensembles design document (Google doc) | > https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org