[
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525520#comment-16525520
]
Matthias Boehm commented on SYSTEMML-2418:
------------------------------------------
You might wanna rephrase the introducing sentence a bit. Fundamentally, we want
both local and distributed data partitioners to support the general case of
arbitrary sizes (don't use "overfitting" because it can easily be confused) -
if the data fits into the driver, we can do local partitioning otherwise we use
distributed data partitioners. For this task, I would recommend to focus on the
distributed partitioning, into k partitions that are the immediate input for
the individual workers without need for materialization on hdfs. In pseudo
code, it would look like {{data.flatmap(d ->
partition(d)).reduceByKey(k).forEach(d -> runWorker(d))}}. Later we can
optionally also allow the materialization on HDFS. The scatch space is shared
by all workers, but for worker-local intermediates and results, we create
dedicated subdirectories.
> Spark data partitioner
> ----------------------
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
> Issue Type: Sub-task
> Reporter: LI Guobao
> Assignee: LI Guobao
> Priority: Major
>
> In the context of ml, the training data will be usually overfitted in spark
> driver node. So to partition such enormous data is no more feasible in CP.
> This task aims to do the data partitioning in distributed way which means
> that the workers will receive its split of training data and do the data
> partition locally according to different schemes. And then all the data will
> be grouped by the given key (i.e., the worker id) and at last be written into
> the seperate HDFS file in scratch place.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)