From a newbie: Questions and will MapReduce fit our needs

Per Steffensen Fri, 26 Aug 2011 04:14:16 -0700

Hi

We are considering to use MapReduce for a project. I am participating inan "investigation"-phase where we try to reveal if we would benefit fromusing the MapReduce framework.


A little bit about the project:

We will be receiving data from the "outside world" in files via FTP. Itwill be a mix of very small files (50 records/lines) and very big files(5mio+ records/lines). The FTP server will be running in a DMZ where wehave no plans of using any Hadoop technology. For every file arrivingover FTP we will add a message (just pointing to that file) to a MQ alsorunning in DMZ - how we do that is not relevant for my questions here.In the secure zone of our system we plan to run many machines (shards ifyou like) a.o. being consumers on the MQ in DMZ. Their job will be a.o.to "load" (storing i db, indexing etc.) the files pointed to by themessages they receive from the MQ. For resonably small files they willprobably just do the "loading" of the entire file themselves. For verybig files we would like to have more machines/shards, than the singlemachine/shard that happens to receive the corresponding message,participating in "loading" that particular file.


Questions:

- In general, do you think MapReduce will be beneficial for us to use?Please remember that the files to be "loaded" does not live on a HDFS.Any descriptions on why you would suggest that we use MapReduce will bevery velcome.

- Reading about MapReduce it sounds to be a general framework able tosplit a "big job" into many smaller "sub-jobs", and have those"sub-jobs" executed concurrently (potentially on other differentmachines), all-in-all to complete the "big job". This could be used formany other things than "working with files", but then again examples andsome of the descriptions makes it sound like it is all only about "jobsworking with files". Is MapReduce only usefull/concerned with "jobs"related to "working with files" or is it more general-purpose so that itis usefull for anysplit-big-job-into-many-smaller-jobs-and-have-those-executed-in-parallel-problem?

- I believe we will end up having a HDFS over the disks on themachines/shards in secure zone. Is HDFS a "must have" for MapReduce towork at all? E.g. HDFS might be the way sub-jobs are distributed and/orpersisted (so that they will not be forgotten i case of a shardbreakdown or something).

- I think it sounds like an overhead to copy the big file (it will haveto be deleted after succesful "loading") from the FTP server disk in DMZto the HDFS in secure zone, just to be able to use MapReduce todistribute the work of "loading" it. We might want to do it in way sothat each "sub-job" (of a "big job" about loading e.g. a big filebig.txt) just points to big.txt together with from- and to- indexes intothe file. Each "sub-job" will then have to only read the part of big.txtfrom from-index to to-index and "load" that. Will we be able to dosomething like that using MapReduce or is it all kind of "based onoperating on files on the HDFS"?

- Depending on the answer to the above question, we might want to beable to make the disk on the FTP server "join" the HDFS, in a way sothat it is visible, but in a way so that data on it will not get copiedin several copies (for redundancy matters) thoughout the disks on theshards (the "real" part of the HDFS) - remember the file will have to bedeleted as soon as it has been "loaded". Is there such aconcept/possibility of making "external" disk visible from HDFS, toenable MapReduce to work on files on such disks, without the files onsuch disks automatically will be copied to several different other disks(on the shards)?

- As it understand it, each "sub-job" (the result of thesplit-operation) will be run on new dedicated JVM. It sounds like a bigoverhead to start a new JVM just to run a "small" job. Is it correctthat each "sub-job" will run on its own new JVM that has to be startedfor that purpose only? If yes, it seems to me like the overhead is only"worth it" for fairly large "sub-jobs". Do you agree?If yes, I find the "WordCount" example onhttp://hadoop.apache.org/common/docs/current/mapred_tutorial.html kindastupid, because it seems like each "sub-job" is only about handling onesingle line, and that seems to me to be way too small "sub-jobs" to makeit "worth the effort" to move it to a remote machine and start a new JVMto handle it. Do you agree that it is stupid (yes, it is just anexample, I know), or what did I miss?

- Finally with respect to side effects. When handling the files we planto load the records in the files into some kind of database (maybeseveral instances of a database). It is important that each record willonly get inserted into one database once. As I understand it, MapReducewill make every "sub-job" run in several instances concurrently onseveral different machines, in order to make sure that it is finishedquickly even if one of the attempts to handle the particular "sub-job"fails. It that true?If yes, isnt that a big problem with respect to "sub-jobs" with sideeffects (like inserting into a database)? Or are there some kind ofbuild-in assumption that all side effects are done on HDFS and that HDFSsupports some kind of transaction-handling so that it is easy forMapReduce to rollback the side effects of one of the "identical"sub-jobs if two should both succeed?In general, is it a build-in thing that each sub-job is running in onesingle transaction, so that it is not possible that a sub-job will"partly" succeed and "partly" fail (e.g. if it has to load 10000 recordsinto a database, and succeeds with 9999 of those it might be stupud toroll it all back in order to try it all all-over again)

I know my english is not perfect, but I hope you at least get theessence of my questions. I hope you will try to answer all thequestions, even though some of them might seem stupid to you. Rememberthat I am a newbie :-) I have been running thourgh the FAQ, but didntfind any answers to my questions (maybe because they are stupid :-) ). Iwasnt able to search the archives of the mailing-list, so I quickly gaveup finding my answers in "old threads". Can someone point me to a way ofsearching in the archives?


Regards, Per Steffensen

From a newbie: Questions and will MapReduce fit our needs

Reply via email to