[
https://issues.apache.org/jira/browse/MAPREDUCE-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817207#comment-13817207
]
Lijie Xu commented on MAPREDUCE-5605:
-------------------------------------
How to implement MapReduce-like framework effectively and efficiently is really
a problem. I just want to know how you deal with the following issues that are
related to the in-memory processing framework.
Multi-threads vs. Multi-processes. Hadoop uses multi-processes to implement
tasks, while Apache Spark uses multi-threads to implement them. Spark chooses
multi-threads because it wants to share data between tasks/jobs but sharing
data between processes (JVMs) is not efficient. Based on the architecture of
this proposal, it seems that you want to congregate the intermediate data of
several mappers and reducers together in memory. So that more controls can be
done to optimize I/O. However, the concrete dataflow is not given, so I want to
know if there is data sharing between tasks/jobs and how large the shared data
will be.
Fault-tolerance: Compared with multi-threads, multi-processes policy has its
advantages: easy to manage and easy to guarantee fault-tolerance. Hadoop is
disk-based and process-based, so failure of a mapper/reducer can be easily
handled by rerunning the lost task on an appropriate node. If it is changed to
memory-based, the safety of intermediate data (e.g., outputs of a mapper) is
not easy to guarantee. Furthermore, the crashes of threads or JVM itself should
be paid attention.
Idempotent: This term means that rerunning any task will not affect the final
result. Putting the input/output/intermediate data of several tasks together
needs special management to keep this feature.
Trade-off of memory usage and disk usage: Since memory is limited and data is
huge, we still need disk to store/swap some data. So when/how to swap the
overcrowded in-memory data onto disk is an important issue and related to the
performance.
> Memory-centric MapReduce aiming to solve the I/O bottleneck
> -----------------------------------------------------------
>
> Key: MAPREDUCE-5605
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5605
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 1.0.1
> Environment: x86-64 Linux/Unix
> 64-bit jdk7 preferred
> Reporter: Ming Chen
> Assignee: Ming Chen
> Fix For: 1.0.1
>
> Attachments: MAPREDUCE-5605-v1.patch,
> hadoop-core-1.0.1-mammoth-0.9.0.jar
>
>
> Memory is a very important resource to bridge the gap between CPUs and I/O
> devices. So the idea is to maximize the usage of memory to solve the problem
> of I/O bottleneck. We developed a multi-threaded task execution engine, which
> runs in a single JVM on a node. In the execution engine, we have implemented
> the algorithm of memory scheduling to realize global memory management, based
> on which we further developed the techniques such as sequential disk
> accessing, multi-cache and solved the problem of full garbage collection in
> the JVM. The benchmark results shows that it can get impressive improvement
> in typical cases. When the a system is relatively short of memory (eg, HPC,
> small- and medium-size enterprises), the improvement will be even more
> impressive.
--
This message was sent by Atlassian JIRA
(v6.1#6144)