[ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2418:
--------------------------------
    Description: In the context of ml, it would be more efficient to support 
the data partitioning in distributed manner. This task aims to do the data 
partitioning on Spark which means that all the data will be firstly splitted 
among workers and then execute data partitioning on worker side according to 
scheme, and then the partitioned data which stay on each worker could be 
directly passed to run model training work.  (was: In the context of ml, the 
training data will be usually overfitted in spark driver node. So to partition 
such enormous data is no more feasible in CP. This task aims to do the data 
partitioning in distributed way which means that the workers will receive its 
split of training data and do the data partition locally according to different 
schemes. And then all the data will be grouped by the given key (i.e., the 
worker id) and at last be written into the seperate HDFS file in scratch place.)

> Spark data partitioner
> ----------------------
>
>                 Key: SYSTEMML-2418
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>
> In the context of ml, it would be more efficient to support the data 
> partitioning in distributed manner. This task aims to do the data 
> partitioning on Spark which means that all the data will be firstly splitted 
> among workers and then execute data partitioning on worker side according to 
> scheme, and then the partitioned data which stay on each worker could be 
> directly passed to run model training work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to