[ 
https://issues.apache.org/jira/browse/SYSTEMML-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Boehm updated SYSTEMML-1378:
-------------------------------------
    Fix Version/s:     (was: SystemML 1.0)
                   SystemML 0.14

> Native dataset support in parfor spark datapartition-execute
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-1378
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1378
>             Project: SystemML
>          Issue Type: Sub-task
>          Components: APIs, Runtime
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>             Fix For: SystemML 0.14
>
>
> This task aims for a deeper integration of Spark Datasets into SystemML. 
> Consider the following example scenario, invoked through MLContext with X 
> being a DataSet<Row>:
> {code}
> X = read(...)
> parfor( i in 1:nrow(X) ) {
>     Xi = X[i, ]
>     v[i, 1] = ... some computation over Xi
> }
> {code}
> Currently, we would convert the input dataset to binary block (1st shuffle) 
> at API level and subsequently pass it into SystemML. For large data, we would 
> then compile a single parfor data-partition execute job that slices row 
> fragments, collects row fragments int partitions (2nd shuffle), and finally 
> executes the parfor body per partition. 
> Native dataset support would allow us to avoid these two shuffles and compute 
> the entire parfor in a data-local manner. In detail, this involves the 
> following extensions:
> * API level: Keep lineage of input dataset leveraging our existing lineage 
> mechanism in {{MatrixObject}}
> * Parfor datapartition-execute: SYSTEMML-1367 already introduced the 
> data-local processing for special cases (if ncol<=blocksize). Given the 
> lineage, we can simply probe the input to datapartition-execute and, for row 
> partitioning, use directly the dataset instead of the reblocked matrix rdd in 
> a data-local manner. This does not just avoid the 2nd shuffle but due to lazy 
> evaluation also the 1st shuffle if no operation other than parfor accesses X 
> (except zipwithindex if no ids are passed in, as this transformation triggers 
> computation)
> * Cleanup: Prevent cleanup (unpersist) of lineage objects of type dataset as 
> they are passed from outside.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to