Advitya17 opened a new pull request #513:
URL: https://github.com/apache/madlib/pull/513
JIRA: MADLIB-{1447,1448,1449}
We integrate AutoML capabilities in Apache MADlib by introducing a function
called `madlib_keras_automl`, which bridges the worlds of setting and running
model selection together, and helps automate and accelerate the model selection
and training processes end-to-end. The user can declaratively specify the names
of their train/val datasets, mst and output tables, model architecture and
param grid details, the chosen method name and associated params, and various
training details, and our API handles the scheduling and execution components
with the algorithm workload info displayed to the user.
The first AutoML algorithm we implement is Hyperband, a state-of-the-art
hyperparameter optimization algorithm which speeds up random search with
adaptive resource allocation, successive halving (SHA) and early stopping. This
algorithm generates a schedule with user inputs and evaluates model
configurations in a smarter, more efficient way by continually exploring more
promising configurations.
In the case of MPP databases such as Greenplum, we further accelerate this
algorithm by simultaneously evaluating multiple rounds of the algorithm located
along a 'diagonal', to keep machines busy and take advantage of the large
distributed storage and compute power offered by Greenplum.
With the diagonal approach, we introduce some additional low-level
optimizations with the implementation related to optimal runtimes and code
quality by:
1. Reducing number of random search function calls from `s_max+1` to just
`1`.
2. Reducing number of multiple model training function calls from
`s_max(s_max+1)/2` to `s_max+1`.
3. Reducing number of sampled SHA configuration groups from `s_max+1` to
`s_max+1-skip_last` (i.e. only sampling the configurations actually needed for
evaluation).
Key:
R --> maximum amount of resources/iterations that can be allocated to a
single configuration in any particular round of Hyperband
eta --> factor controlling the proportion of configs discarded in each round
of SHA
s_max = floor(log(R)/log(eta)) --> controls the number of SHA brackets
(=s_max+1) executed with Hyperband
skip_last --> Number of diagonals to skip at the end (to avoid running the
most time/resource intensive bracket(s) and/or to avoid overfitting or loss in
predictive power). skip_last ∈ [0, s_max]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]