[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

WeichenXu123 Fri, 02 Sep 2016 07:27:37 -0700

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14922#discussion_r77354917
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala ---
    @@ -405,5 +405,9 @@ private[ml] trait HasAggregationDepth extends Params {
     
       /** @group expertGetParam */
       final def getAggregationDepth: Int = $(aggregationDepth)
    +
    +  def getAggregationDepthByFormula(driverMemory: Long, dimension: Int, 
partitionNum: Int): Int = {
    +    if (dimension > 200) 2 else 1
    --- End diff --
    
    The `RDD.treeAggregate(level)` do aggregation in levels.
    when level == 1, each RDD partition do seq aggregation and the driver side 
directly pull all partitions aggregation result, and then seq aggregate all 
middle aggregtion.
    but when level == 2, each RDD partition do seq aggregation(generate middle 
result 1) and then shuffle each partition aggr result to several reducers(level 
1), these reducers aggregate them(generate middle result 2) and at last driver 
side pull all middle result 2 and do the last aggregation.
    
    so, the proper aggregateLevel depends on the memory setting of driver, the 
dimension of problems, and the number of partition, for example, 
    if the dimension is large and there are many partitions, if aggregateLevel 
== 1,
    it may cause driver-side out-of-memory, and may cause low performance.
    but if dimension is small, the  aggregateLevel == 1 is enough, 
aggregateLevel > 1
    will cause more cost in reverse.
    so it is worth to decide a proper formula,
    when we know the memory setting of driver, the dimension of problems, and 
the number of partition, how to determine the best `aggregationLevel` value to 
reach best performance.
    these days I'm busy, and I need several days to do some detail test, and 
update this PR.
    thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14922: [WIP][SPARK-17175][ML][MLLib] Add a expert formul...

Reply via email to