Please help me architect the design of my first significant MR task beyond "word count". My program works well. but I am trying to optimize performance to maximize use of available computing resources. I have 3 questions at the bottom.
Project description in an abstract sense (written in java): * I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 4000.manif.txt * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS (range could be 10000 - 50,000 EBOOKS urls per MANIFEST) -- stored on storage:/root/1.manif/1223.folder/5443.Ebook.ebk So we are talking about millions of ebooks My task is to: 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: publisher, year, ebook-version). 2. Update each of the EBOOK entry record in the manifest - with the 3 attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01) 3. Create a output file such that the named "<publisher>_<year>_<ebook-version>" contains a list of all "ebook urls" that met that criteria. example: File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains: storage:/root/1.manif/1223.folder/2143.Ebook.ebk storage:/root/2.manif/2133.folder/5449.Ebook.ebk storage:/root/2.manif/2133.folder/5450.Ebook.ebk etc.. and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains: storage:/root/19.manif/2223.folder/4343.Ebook.ebk storage:/root/13.manif/9733.folder/2149.Ebook.ebk storage:/root/21.manif/3233.folder/1110.Ebook.ebk etc 4. finally, I also want to output statistics such that: <publisher>_<year>_<ebook-version> <COUNT_OF_URLs> PENGUIN_2001_3.12 250,111 RANDOMHOUSE_1999_2.01 11,322 etc Here is how I implemented: * My launcher gets list of MM manifests * My Mapper gets one manifest. --- It reads the manifest, within a WHILE loop, --- fetches each EBOOK, and obtain attributes from each ebook, --- updates the manifest for that ebook --- context.write(new Text("RANDOMHOUSE_1999_2.01"), new Text("storage:/root/1.manif/1223.folder/2143.Ebook.ebk")) --- Once all ebooks in the manifest are read, it saves the updated Manifest, and exits * My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls. --- It writes a new file "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" with all the storage urls for the ebooks --- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), new IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST)) As I mentioned, its working. I launch it on 15 elastic instances. I have three questions: 1. Is this the best way to implement the MR logic? 2. I dont know if each of the instances is getting one task or multiple tasks simultaneously for the MAP portion. If it is not getting multiple MAP tasks, should I go with the route of "multithreaded" reading of ebooks from each manifest? Its not efficient to read just one ebook at a time per machine. Is "Context.write()" threadsafe? 3. I can see log4j logs for main program, but no visibility into logs for Mapper or Reducer. Any idea?