[
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LI Guobao updated SYSTEMML-2418:
--------------------------------
Description: In the context of ml, it would be more efficient to support
the data partitioning in distributed manner. This task aims to do the data
partitioning on Spark which means that all the data will be firstly splitted
among workers and then execute data partitioning on worker side according to
scheme, and then the partitioned data which stay on each worker could be
directly passed to run model training work without materialization on HDFS.
(was: In the context of ml, it would be more efficient to support the data
partitioning in distributed manner. This task aims to do the data partitioning
on Spark which means that all the data will be firstly splitted among workers
and then execute data partitioning on worker side according to scheme, and then
the partitioned data which stay on each worker could be directly passed to run
model training work.)
> Spark data partitioner
> ----------------------
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
> Issue Type: Sub-task
> Reporter: LI Guobao
> Assignee: LI Guobao
> Priority: Major
>
> In the context of ml, it would be more efficient to support the data
> partitioning in distributed manner. This task aims to do the data
> partitioning on Spark which means that all the data will be firstly splitted
> among workers and then execute data partitioning on worker side according to
> scheme, and then the partitioned data which stay on each worker could be
> directly passed to run model training work without materialization on HDFS.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)