[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525520#comment-16525520 ]
Matthias Boehm commented on SYSTEMML-2418: ------------------------------------------ You might wanna rephrase the introducing sentence a bit. Fundamentally, we want both local and distributed data partitioners to support the general case of arbitrary sizes (don't use "overfitting" because it can easily be confused) - if the data fits into the driver, we can do local partitioning otherwise we use distributed data partitioners. For this task, I would recommend to focus on the distributed partitioning, into k partitions that are the immediate input for the individual workers without need for materialization on hdfs. In pseudo code, it would look like {{data.flatmap(d -> partition(d)).reduceByKey(k).forEach(d -> runWorker(d))}}. Later we can optionally also allow the materialization on HDFS. The scatch space is shared by all workers, but for worker-local intermediates and results, we create dedicated subdirectories. > Spark data partitioner > ---------------------- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task > Reporter: LI Guobao > Assignee: LI Guobao > Priority: Major > > In the context of ml, the training data will be usually overfitted in spark > driver node. So to partition such enormous data is no more feasible in CP. > This task aims to do the data partitioning in distributed way which means > that the workers will receive its split of training data and do the data > partition locally according to different schemes. And then all the data will > be grouped by the given key (i.e., the worker id) and at last be written into > the seperate HDFS file in scratch place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)