>From what I can see your implementation seems OK, especially from a >performance perspective. Depending on what storage: is it is likely to be your >bottlekneck, not the hadoop computations.
Because you are writing files directly instead of relying on Hadoop to do it for you, you may need to deal with error cases that Hadoop will normally hide from you, and you will not be able to turn on speculative execution. Just be aware that a map or reduce task may have problems in the middle, and be relaunched. So when you are writing out your updated manifest be careful to not replace the old one until the new one is completely ready and will not fail, or you may lose data. You may also need to be careful in your reduce if you are writing directly to the file there too, but because it is not a read modify write, but just a write it is not as critical. --Bobby Evans On 4/18/12 4:56 PM, "Sky USC" <sky...@hotmail.com> wrote: Please help me architect the design of my first significant MR task beyond "word count". My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom. Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS (range could be 10000 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named "<publisher>_<year>_<ebook-version>" contains a list of all "ebook urls" that met that criteria. example: File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage:/root/2.manif/2133.folder/5449.Ebook.ebk storage:/root/2.manif/2133.folder/5450.Ebook.ebk etc.. and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains: storage:/root/19.manif/2223.folder/4343.Ebook.ebk storage:/root/13.manif/9733.folder/2149.Ebook.ebk storage:/root/21.manif/3233.folder/1110.Ebook.ebk etc 4. finally, I also want to output statistics such that: <publisher>_<year>_<ebook-version> <COUNT_OF_URLs> PENGUIN_2001_3.12 250,111 RANDOMHOUSE_1999_2.01 11,322 etc Here is how I implemented: * My launcher gets list of MM manifests * My Mapper gets one manifest. --- It reads the manifest, within a WHILE loop, --- fetches each EBOOK, and obtain attributes from each ebook, --- updates the manifest for that ebook --- context.write(new Text("RANDOMHOUSE_1999_2.01"), new Text("storage:/root/1.manif/1223.folder/2143.Ebook.ebk")) --- Once all ebooks in the manifest are read, it saves the updated Manifest, and exits * My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls. --- It writes a new file "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" with all the storage urls for the ebooks --- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), new IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST)) As I mentioned, its working. I launch it on 15 elastic instances. I have three questions: 1. Is this the best way to implement the MR logic? 2. I dont know if each of the instances is getting one task or multiple tasks simultaneously for the MAP portion. If it is not getting multiple MAP tasks, should I go with the route of "multithreaded" reading of ebooks from each manifest? Its not efficient to read just one ebook at a time per machine. Is "Context.write()" threadsafe? 3. I can see log4j logs for main program, but no visibility into logs for Mapper or Reducer. Any idea?