[ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525520#comment-16525520
 ] 

Matthias Boehm commented on SYSTEMML-2418:
------------------------------------------

You might wanna rephrase the introducing sentence a bit. Fundamentally, we want 
both local and distributed data partitioners to support the general case of 
arbitrary sizes (don't use "overfitting" because it can easily be confused) - 
if the data fits into the driver, we can do local partitioning otherwise we use 
distributed data partitioners. For this task, I would recommend to focus on the 
distributed partitioning, into k partitions that are the immediate input for 
the individual workers without need for materialization on hdfs. In pseudo 
code, it would look like {{data.flatmap(d -> 
partition(d)).reduceByKey(k).forEach(d -> runWorker(d))}}. Later we can 
optionally also allow the materialization on HDFS. The scatch space is shared 
by all workers, but for worker-local intermediates and results, we create 
dedicated subdirectories.

> Spark data partitioner
> ----------------------
>
>                 Key: SYSTEMML-2418
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>
> In the context of ml, the training data will be usually overfitted in spark 
> driver node. So to partition such enormous data is no more feasible in CP. 
> This task aims to do the data partitioning in distributed way which means 
> that the workers will receive its split of training data and do the data 
> partition locally according to different schemes. And then all the data will 
> be grouped by the given key (i.e., the worker id) and at last be written into 
> the seperate HDFS file in scratch place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to