[ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061064#comment-15061064
 ] 

Joseph K. Bradley commented on SPARK-12272:
-------------------------------------------

First comment: I'd check the number of partitions and the Spark UI to make sure 
workers are doing equal amounts of work.

Second comment: MLlib follows the PLANET implementation, so it will have 
trouble with that many features.  There is ongoing work to overcome that issue: 
[SPARK-3717]; I hope to push that work into Spark within a couple of months.

Third comment: My understanding of xgboost is that it trains each tree on a 
single worker, using a subset of the data (only the data on that 1 worker).  
This differs from other implementations, which train each tree on all of the 
data.  This means xgboost does not have to communicate much data, but also 
means its trees cannot be as accurate individually; it's a trade-off.  There is 
a JIRA for exploring xgboost on Spark: [SPARK-8547]

I hope these 2 linked JIRAs will address your needs!

> Gradient boosted trees: too slow at the first finding best siplts
> -----------------------------------------------------------------
>
>                 Key: SPARK-12272
>                 URL: https://issues.apache.org/jira/browse/SPARK-12272
>             Project: Spark
>          Issue Type: Request
>          Components: MLlib
>    Affects Versions: 1.5.2
>            Reporter: Wenmin Wu
>         Attachments: training-log1.png, training-log2.pnd.png, 
> training-log3.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to