Github user daniel-siegmann-aol commented on the pull request:

    https://github.com/apache/spark/pull/12761#issuecomment-215561740
  
    I'll give the results of my own training flow too. Testing was done on EMR 
4.4.0 with Spark 1.6.0. The cluster was configured with six r3.8xlarge nodes: 
one master, two core, three task.
    
    Here's my original configuration. Note the driver needs a lot of memory, 
since there features vector size is ~ 20 million). Also note that there are 
only 10 executors used, because 100+ executors would increase the training time 
over 10 hours.
    
    ```
        --conf spark.driver.memory=128g
        --conf spark.driver.maxResultSize=112g
        --conf spark.executor.instances=10
        --conf spark.executor.memory=16g
        --conf spark.yarn.executor.memoryOverhead=2048
        --conf spark.default.parallelism=10
        --conf spark.sql.shuffle.partitions=10
    ```
    For these numbers, the first aggregation is done in the 
`MultivariateOnlineSummarizer`, while the subsequent aggregations are done in 
the `LogisticAggregator` (a private class in `LogisticRegression`).
    
    **Original**
    Total time: 1.3 hours
    First agg. time: 13 minutes
    Other agg. time: 2.1 - 2.5 minutes
    
    **With Sparse Aggregation**
    Total time: 10 minutes
    First agg. time: 31 seconds
    Other agg. time: 4 - 10 seconds
    
    Additionally, with sparse aggregation, increasing to 100 executors did not 
increase (or decrease) execution time. Memory could also be reduced - I tested 
with 8 GB on driver with no problem.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to