Hi
We are considering to use MapReduce for a project. I am participating in
an "investigation"-phase where we try to reveal if we would benefit from
using the MapReduce framework.
A little bit about the project:
We will be receiving data from the "outside world" in files via FTP. It
will be a mix of very small files (50 records/lines) and very big files
(5mio+ records/lines). The FTP server will be running in a DMZ where we
have no plans of using any Hadoop technology. For every file arriving
over FTP we will add a message (just pointing to that file) to a MQ also
running in DMZ - how we do that is not relevant for my questions here.
In the secure zone of our system we plan to run many machines (shards if
you like) a.o. being consumers on the MQ in DMZ. Their job will be a.o.
to "load" (storing i db, indexing etc.) the files pointed to by the
messages they receive from the MQ. For resonably small files they will
probably just do the "loading" of the entire file themselves. For very
big files we would like to have more machines/shards, than the single
machine/shard that happens to receive the corresponding message,
participating in "loading" that particular file.
Questions:
- In general, do you think MapReduce will be beneficial for us to use?
Please remember that the files to be "loaded" does not live on a HDFS.
Any descriptions on why you would suggest that we use MapReduce will be
very velcome.
- Reading about MapReduce it sounds to be a general framework able to
split a "big job" into many smaller "sub-jobs", and have those
"sub-jobs" executed concurrently (potentially on other different
machines), all-in-all to complete the "big job". This could be used for
many other things than "working with files", but then again examples and
some of the descriptions makes it sound like it is all only about "jobs
working with files". Is MapReduce only usefull/concerned with "jobs"
related to "working with files" or is it more general-purpose so that it
is usefull for any
split-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem?
- I believe we will end up having a HDFS over the disks on the
machines/shards in secure zone. Is HDFS a "must have" for MapReduce to
work at all? E.g. HDFS might be the way sub-jobs are distributed and/or
persisted (so that they will not be forgotten i case of a shard
breakdown or something).
- I think it sounds like an overhead to copy the big file (it will have
to be deleted after succesful "loading") from the FTP server disk in DMZ
to the HDFS in secure zone, just to be able to use MapReduce to
distribute the work of "loading" it. We might want to do it in way so
that each "sub-job" (of a "big job" about loading e.g. a big file
big.txt) just points to big.txt together with from- and to- indexes into
the file. Each "sub-job" will then have to only read the part of big.txt
from from-index to to-index and "load" that. Will we be able to do
something like that using MapReduce or is it all kind of "based on
operating on files on the HDFS"?
- Depending on the answer to the above question, we might want to be
able to make the disk on the FTP server "join" the HDFS, in a way so
that it is visible, but in a way so that data on it will not get copied
in several copies (for redundancy matters) thoughout the disks on the
shards (the "real" part of the HDFS) - remember the file will have to be
deleted as soon as it has been "loaded". Is there such a
concept/possibility of making "external" disk visible from HDFS, to
enable MapReduce to work on files on such disks, without the files on
such disks automatically will be copied to several different other disks
(on the shards)?
- As it understand it, each "sub-job" (the result of the
split-operation) will be run on new dedicated JVM. It sounds like a big
overhead to start a new JVM just to run a "small" job. Is it correct
that each "sub-job" will run on its own new JVM that has to be started
for that purpose only? If yes, it seems to me like the overhead is only
"worth it" for fairly large "sub-jobs". Do you agree?
If yes, I find the "WordCount" example on
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html kinda
stupid, because it seems like each "sub-job" is only about handling one
single line, and that seems to me to be way too small "sub-jobs" to make
it "worth the effort" to move it to a remote machine and start a new JVM
to handle it. Do you agree that it is stupid (yes, it is just an
example, I know), or what did I miss?
- Finally with respect to side effects. When handling the files we plan
to load the records in the files into some kind of database (maybe
several instances of a database). It is important that each record will
only get inserted into one database once. As I understand it, MapReduce
will make every "sub-job" run in several instances concurrently on
several different machines, in order to make sure that it is finished
quickly even if one of the attempts to handle the particular "sub-job"
fails. It that true?
If yes, isnt that a big problem with respect to "sub-jobs" with side
effects (like inserting into a database)? Or are there some kind of
build-in assumption that all side effects are done on HDFS and that HDFS
supports some kind of transaction-handling so that it is easy for
MapReduce to rollback the side effects of one of the "identical"
sub-jobs if two should both succeed?
In general, is it a build-in thing that each sub-job is running in one
single transaction, so that it is not possible that a sub-job will
"partly" succeed and "partly" fail (e.g. if it has to load 10000 records
into a database, and succeeds with 9999 of those it might be stupud to
roll it all back in order to try it all all-over again)
I know my english is not perfect, but I hope you at least get the
essence of my questions. I hope you will try to answer all the
questions, even though some of them might seem stupid to you. Remember
that I am a newbie :-) I have been running thourgh the FAQ, but didnt
find any answers to my questions (maybe because they are stupid :-) ). I
wasnt able to search the archives of the mailing-list, so I quickly gave
up finding my answers in "old threads". Can someone point me to a way of
searching in the archives?
Regards, Per Steffensen