[ https://issues.apache.org/jira/browse/SYSTEMML-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Dusenberry updated SYSTEMML-493: ------------------------------------- Component/s: Algorithms > Modularize Existing DML Algorithms > ---------------------------------- > > Key: SYSTEMML-493 > URL: https://issues.apache.org/jira/browse/SYSTEMML-493 > Project: SystemML > Issue Type: Epic > Components: Algorithms > Reporter: Mike Dusenberry > > Currently, our provided DML algorithms come in the form of single, long > scripts that contain the read and write statements, are usually not broken up > into modular UDFs, and require the user to supply all arguments via the > command line or bash scripts. As a high-level example: > {code} > // read statements, parameter parsing, etc. > X = read(...) > hyperparam1 = $1 > anotherHyperparam = $2 > ... > // core part of the algorithm > // note: this is not broken up by a udf, and instead is just a continuation > of the script > while(!converged) { > // do stuff > } > // outputs, test results, stats, etc > write(...) > print(...) > {code} > The issue here is that many ML algorithms require hyperparameter tuning, and > are part of a general data flow (data ingestion, cleaning, splitting, etc.). > Due to this, it would be ideal if our algorithm scripts were modularized so > that the core parts of the algorithms were wrapped in UDFs (i.e. > {{train(...)}}, {{test(...)}}, etc.). Then, rather than having to perform > these additional steps from a bash script, a user could instead import our > algorithm scripts from DML, and make calls to the UDFs as necessary. As an > example of the modification to our scripts: > {code} > // read statements, parameter parsing, etc. > X = read(...) > hyperparam1 = $1 > anotherHyperparam = $2 > ... > // core part of the algorithm > // note: this is wrapped in a udf, thus allowing the user to import and > supply arguments from another DML script if desired > train = function (matrix[double] X, double hyperparam1, double hyperparam2) > return (matrix[double] model) { > while(!converged) { > // do stuff > } > } > // when run as a script, this will invoke the `train(...)` function, thus > achieving the same result as the previous script design > model = train(X, hyperparam1, anotherHyperparam) > // outputs, test results, stats, etc > write(...) > print(...) > {code} > By modularizing the core parts of the algorithms into UDFs, yet still keep > the surrounding read/write statements, this will allow our provided scripts > to be executed as scripts in the (currently) normal fashion, while also > allowing them to be imported from other DML scripts for the use of the UDFs > directly. As an example of a custom DML workflow script: > {code} > // import > source("LinearReg.dml") as lr > // ingest data > X_dirty = read(...) > // clean data > X = ... > // split > X_train = ... > X_val = ... > X_test = ... > // hyperparameter tuning > while(tuning) { > hyperparam1= ... > hyperparam2 = ... > model = lr::train(X, hyperparam1, hyperparam2) > error = lr::test(X_val, ...) > ... > } > // use best hyperparameters > ... > // save model > write(model) > {code} > This change could be applied to all of our provided DML algorithms, and many > could be broken up into {{train(...)}}, {{test(...)}}, {{stats(...)}}, etc. > functions. The goal here is to promote the use of DML for the entire ML > pipeline (i.e. the way Python, R, Scala, etc. are currently being used), > rather than encouraging the use of cumbersome bash scripts. -- This message was sent by Atlassian JIRA (v6.4.14#64029)