[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154051#comment-14154051
 ] 

Manish Amde commented on SPARK-1547:
------------------------------------

Adding Hirakendu's feedback on checkpointing below.

~~~~~~~~~~~~~

Hi Manish,

Just looked at the JIRA, quite impressive progress. Will create a JIRA account 
and submit the Apache CLA, long time procrastinating :).

About checkpointing, as I may have discussed before, I had a very crude 
solution to the problem with long lineage chains. I simply cache the residual 
dataset periodically after a certain number of iterations. I uncache the 
previous cached dataset prior to caching the current one. It does materialize 
the dataset, but apparently doesn't break the lineage graph. Nonetheless, in 
practice I see that speed improves to as if I started afresh. It's a crude form 
of checkpointing, but it would be good to do the real thing.

About the choice of interval, obviously it depends on the dataset. Suppose 
there is a linear increase in time with iterations (the map operation to 
subtract the previous tree predictions), and say it takes time t for each such 
map operation. If we break the chain every B iterations, then the time for such 
map operations is t + 2t + ... + Bt ~= B^2 * t. Suppose the time taken to 
checkpoint/materialize is c, then the time for the period of B iterations is  
B^2 * t + c. Thus, the time per iteration is (B^2 * t  + c)/B = B*t + c/B. 
Thus, B = sqrt(c/t). (Ignoring the -1s and factor of 2s in B^2 * t.)

Of course, t and c depend on the dataset. So setting aside all calculations, in 
my implementation B is a user defined parameter. In practice, I run it once and 
see how much time it takes current iteration and when it slows down to the 
extent that it's same as the first iteration. It means might have as started 
afresh. Note that mathematically, for optimal B = sqrt(c/t), the last iteration 
takes time B^2 t = c, which is same as the checkpoint time.

Clearly, this can be automated by keeping track of iteration times, just 
subtract the actual computation time and leave some room for noisy running 
times. Nonetheless, would be good to have a user defined override by a optional 
parameter.

I think a similar optional parameter may also be provided to override the 
automatically determined batch size for deep trees.

Btw, the comment in the JIRA about support for sparse vectors is something nice 
to have and something I have been thinking about. Sparse datasets are now far 
too common and flexible, although the go-to solution is by linear models.

Thanks,
Hirakendu.

> Add gradient boosting algorithm to MLlib
> ----------------------------------------
>
>                 Key: SPARK-1547
>                 URL: https://issues.apache.org/jira/browse/SPARK-1547
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.0.0
>            Reporter: Manish Amde
>            Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation
> [Ensembles design document (Google doc) | 
> https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to