Github user daniel-siegmann-aol commented on the pull request:
https://github.com/apache/spark/pull/12761#issuecomment-215561740
I'll give the results of my own training flow too. Testing was done on EMR
4.4.0 with Spark 1.6.0. The cluster was configured with six r3.8xlarge nodes:
one master, two core, three task.
Here's my original configuration. Note the driver needs a lot of memory,
since there features vector size is ~ 20 million). Also note that there are
only 10 executors used, because 100+ executors would increase the training time
over 10 hours.
```
--conf spark.driver.memory=128g
--conf spark.driver.maxResultSize=112g
--conf spark.executor.instances=10
--conf spark.executor.memory=16g
--conf spark.yarn.executor.memoryOverhead=2048
--conf spark.default.parallelism=10
--conf spark.sql.shuffle.partitions=10
```
For these numbers, the first aggregation is done in the
`MultivariateOnlineSummarizer`, while the subsequent aggregations are done in
the `LogisticAggregator` (a private class in `LogisticRegression`).
**Original**
Total time: 1.3 hours
First agg. time: 13 minutes
Other agg. time: 2.1 - 2.5 minutes
**With Sparse Aggregation**
Total time: 10 minutes
First agg. time: 31 seconds
Other agg. time: 4 - 10 seconds
Additionally, with sparse aggregation, increasing to 100 executors did not
increase (or decrease) execution time. Memory could also be reduced - I tested
with 8 GB on driver with no problem.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]