[ 
https://issues.apache.org/jira/browse/SYSTEMML-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dusenberry updated SYSTEMML-493:
-------------------------------------
    Component/s: Algorithms

> Modularize Existing DML Algorithms
> ----------------------------------
>
>                 Key: SYSTEMML-493
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-493
>             Project: SystemML
>          Issue Type: Epic
>          Components: Algorithms
>            Reporter: Mike Dusenberry
>
> Currently, our provided DML algorithms come in the form of single, long 
> scripts that contain the read and write statements, are usually not broken up 
> into modular UDFs, and require the user to supply all arguments via the 
> command line or bash scripts.  As a high-level example:
> {code}
> // read statements, parameter parsing, etc.
> X = read(...)
> hyperparam1 = $1
> anotherHyperparam = $2
> ...
> // core part of the algorithm
> // note: this is not broken up by a udf, and instead is just a continuation 
> of the script
> while(!converged) {
>  // do stuff
> }
> // outputs, test results, stats, etc
> write(...)
> print(...)
> {code}
> The issue here is that many ML algorithms require hyperparameter tuning, and 
> are part of a general data flow (data ingestion, cleaning, splitting, etc.).  
> Due to this, it would be ideal if our algorithm scripts were modularized so 
> that the core parts of the algorithms were wrapped in UDFs (i.e. 
> {{train(...)}}, {{test(...)}}, etc.).  Then, rather than having to perform 
> these additional steps from a bash script, a user could instead import our 
> algorithm scripts from DML, and make calls to the UDFs as necessary.  As an 
> example of the modification to our scripts:
> {code}
> // read statements, parameter parsing, etc.
> X = read(...)
> hyperparam1 = $1
> anotherHyperparam = $2
> ...
> // core part of the algorithm
> // note: this is wrapped in a udf, thus allowing the user to import and 
> supply arguments from another DML script if desired
> train = function (matrix[double] X, double hyperparam1, double hyperparam2) 
> return (matrix[double] model) {
>     while(!converged) {
>      // do stuff
>     }
> }
> // when run as a script, this will invoke the `train(...)` function, thus 
> achieving the same result as the previous script design
> model = train(X, hyperparam1, anotherHyperparam)
> // outputs, test results, stats, etc
> write(...)
> print(...)
> {code}
> By modularizing the core parts of the algorithms into UDFs, yet still keep 
> the surrounding read/write statements, this will allow our provided scripts 
> to be executed as scripts in the (currently) normal fashion, while also 
> allowing them to be imported from other DML scripts for the use of the UDFs 
> directly.  As an example of a custom DML workflow script:
> {code}
> // import
> source("LinearReg.dml") as lr
> // ingest data
> X_dirty = read(...)
> // clean data
> X = ...
> // split
> X_train = ...
> X_val = ...
> X_test = ...
> // hyperparameter tuning
> while(tuning) {
>     hyperparam1= ...
>     hyperparam2 = ...
>     model = lr::train(X, hyperparam1, hyperparam2)
>     error = lr::test(X_val, ...)
>     ...
> }
> // use best hyperparameters
> ...
> // save model
> write(model)
> {code}
> This change could be applied to all of our provided DML algorithms, and many 
> could be broken up into {{train(...)}}, {{test(...)}}, {{stats(...)}}, etc. 
> functions.  The goal here is to promote the use of DML for the entire ML 
> pipeline (i.e. the way Python, R, Scala, etc. are currently being used), 
> rather than encouraging the use of cumbersome bash scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to