>From what I can see your implementation seems OK, especially from a 
>performance perspective. Depending on what storage: is it is likely to be your 
>bottlekneck, not the hadoop computations.

Because you are writing files directly instead of relying on Hadoop to do it 
for you, you may need to deal with error cases that Hadoop will normally hide 
from you, and you will not be able to turn on speculative execution.  Just be 
aware that a map or reduce task may have problems in the middle, and be 
relaunched.  So when you are writing out your updated manifest be careful to 
not replace the old one until the new one is completely ready and will not 
fail, or you may lose data.  You may also need to be careful in your reduce if 
you are writing directly to the file there too, but because it is not a read 
modify write, but just a write it is not as critical.

--Bobby Evans

On 4/18/12 4:56 PM, "Sky USC" <sky...@hotmail.com> wrote:




Please help me architect the design of my first significant MR task beyond 
"word count". My program works well. but I am trying to optimize performance to 
maximize use of available computing resources. I have 3 questions at the bottom.

Project description in an abstract sense (written in java):
* I have MM number of MANIFEST files available on storage:/root/1.manif.txt to 
4000.manif.txt
     * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS 
(range could be 10000 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
storage:/root/1.manif/1223.folder/5443.Ebook.ebk
So we are talking about millions of ebooks

My task is to:
1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
publisher, year, ebook-version).
2. Update each of the EBOOK entry record in the manifest - with the 3 
attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01)
3. Create a output file such that the named 
"<publisher>_<year>_<ebook-version>"  contains a list of all "ebook urls" that 
met that criteria.
example:
File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains:
storage:/root/1.manif/1223.folder/2143.Ebook.ebk
storage:/root/2.manif/2133.folder/5449.Ebook.ebk
storage:/root/2.manif/2133.folder/5450.Ebook.ebk
etc..

and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains:
storage:/root/19.manif/2223.folder/4343.Ebook.ebk
storage:/root/13.manif/9733.folder/2149.Ebook.ebk
storage:/root/21.manif/3233.folder/1110.Ebook.ebk

etc

4. finally, I also want to output statistics such that:
<publisher>_<year>_<ebook-version>  <COUNT_OF_URLs>
PENGUIN_2001_3.12     250,111
RANDOMHOUSE_1999_2.01  11,322
etc

Here is how I implemented:
* My launcher gets list of MM manifests
* My Mapper gets one manifest.
 --- It reads the manifest, within a WHILE loop,
    --- fetches each EBOOK,  and obtain attributes from each ebook,
    --- updates the manifest for that ebook
    --- context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
Text("storage:/root/1.manif/1223.folder/2143.Ebook.ebk"))
 --- Once all ebooks in the manifest are read, it saves the updated Manifest, 
and exits
* My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls.
 --- It writes a new file "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" 
with all the storage urls for the ebooks
 --- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))

As I mentioned, its working. I launch it on 15 elastic instances. I have three 
questions:
1. Is this the best way to implement the MR logic?
2. I dont know if each of the instances is getting one task or multiple tasks 
simultaneously for the MAP portion. If it is not getting multiple MAP tasks, 
should I go with the route of "multithreaded" reading of ebooks from each 
manifest? Its not efficient to read just one ebook at a time per machine. Is 
"Context.write()" threadsafe?
3. I can see log4j logs for main program, but no visibility into logs for 
Mapper or Reducer. Any idea?




Reply via email to