[
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16474602#comment-16474602
]
Matthias Boehm commented on SYSTEMML-2299:
------------------------------------------
[~Guobao] initially I would leave it up to the user by allowing the
specification via a parameter such as {{mode}} (maybe rename the existing mode
to utype?). This is similar to parfor were a user can specify the execution
mode as {{LOCAL}}, {{REMOTE_MR}} or {{REMOTE_SPARK}}. While building the
runtime, let's make this parameter mandatory. Later we can generalize that and
automatically decide the execution mode if not provided by the user: for
example, we could compute a cost estimate based on the number of floating point
operations per batch, scaled by the number of epochs and datasize. If a certain
minimum cost threshold is exceeded and all memory constraints are met, we could
automatically route it to distributed operations.
> API design of the paramserv function
> ------------------------------------
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
> Issue Type: Sub-task
> Reporter: LI Guobao
> Assignee: LI Guobao
> Priority: Major
>
> The objective of “paramserv” built-in function is to update an initial or
> existing model with configuration. An initial function signature would be:
> {code:java}
> model'=paramserv(model, features=X, labels=Y, val_features=X_val,
> val_labels=Y_val, upd="fun1", agg="fun2", mode="BSP", freq="BATCH",
> epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous",
> hyperparams=params, checkpointing="NONE"){code}
> We are interested in providing the model (which will be a struct-like data
> structure consisting of the weights, the biases and the hyperparameters), the
> training features and labels, the validation features and labels, the batch
> update function (i.e., gradient calculation func), the update strategy (e.g.
> sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch
> or mini-batch), the gradient aggregation function, the number of epoch, the
> batch size, the degree of parallelism, the data partition scheme, a list of
> additional hyper parameters, as well as the checkpointing strategy. And the
> function will return a trained model in struct format.
> *Inputs*:
> * model <list>: a list consisting of the weight and bias matrices
> * features <matrix>: training features matrix
> * labels <matrix>: training label matrix
> * val_features <matrix>: validation features matrix
> * val_labels <matrix>: validation label matrix
> * upd <string>: the name of gradient calculation function
> * agg <string>: the name of gradient aggregation function
> * mode <string> (options: BSP, ASP, SSP): the updating mode
> * freq <string> (options: EPOCH, BATCH): the frequence of updates
> * epochs <integer>: the number of epoch
> * batchsize <integer> [optional]: the size of batch, if the update frequence
> is "EPOCH", this argument will be ignored
> * k <integer>: the degree of parallelism
> * scheme <string> (options: disjoint_contiguous, disjoint_round_robin,
> disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how
> the data is distributed across workers
> * hyperparams <list> [optional]: a list consisting of the additional hyper
> parameters, e.g., learning rate, momentum
> * checkpointing <string> (options: NONE(default), EPOCH, EPOCH10)
> [optional]: the checkpoint strategy, we could set a checkpoint for each epoch
> or each 10 epochs
> *Output*:
> * model' <list>: a list consisting of the updated weight and bias matrices
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)