Help me with architecture of a somewhat non-trivial mapreduce implementation

Sky USC Wed, 18 Apr 2012 14:56:40 -0700


Please help me architect the design of my first significant MR task beyond 
"word count". My program works well. but I am trying to optimize performance to 
maximize use of available computing resources. I have 3 questions at the 
bottom.


Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 
4000.manif.txt
     * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS 
(range could be 10000 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
storage:/root/1.manif/1223.folder/5443.Ebook.ebk
So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
publisher, year, ebook-version). 
2. Update each of the EBOOK entry record in the manifest - with the 3 
attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named 
"<publisher>_<year>_<ebook-version>"  contains a list of all "ebook urls" that 
met that criteria.
example: 
File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I also want to output statistics such that:
<publisher>_<year>_<ebook-version>  <COUNT_OF_URLs>
PENGUIN_2001_3.12     250,111
RANDOMHOUSE_1999_2.01  11,322
etc

Here is how I implemented:
* My launcher gets list of MM manifests 
* My Mapper gets one manifest. 
 --- It reads the manifest, within a WHILE loop, 
    --- fetches each EBOOK,  and obtain attributes from each ebook, 
    --- updates the manifest for that ebook
    --- context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
Text("storage:/root/1.manif/1223.folder/2143.Ebook.ebk"))
 --- Once all ebooks in the manifest are read, it saves the updated Manifest, 
and exits
* My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls.
 --- It writes a new file "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" 
with all the storage urls for the ebooks
 --- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))

As I mentioned, its working. I launch it on 15 elastic instances. I have three 
questions:
1. Is this the best way to implement the MR logic?
2. I dont know if each of the instances is getting one task or multiple tasks 
simultaneously for the MAP portion. If it is not getting multiple MAP tasks, 
should I go with the route of "multithreaded" reading of ebooks from each 
manifest? Its not efficient to read just one ebook at a time per machine. Is 
"Context.write()" threadsafe?
3. I can see log4j logs for main program, but no visibility into logs for 
Mapper or Reducer. Any idea?

Help me with architecture of a somewhat non-trivial mapreduce implementation

Reply via email to