Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

Michael Segel Thu, 19 Apr 2012 20:50:14 -0700

How 'large' or rather in this case small is your file? 

If you're on a default system, the block sizes are 64MB. So if your file ~<= 
64MB, you end up with 1 block, and you will only have 1 mapper.



On Apr 19, 2012, at 10:10 PM, Sky wrote:

> Thanks for your reply.  After I sent my email, I found a fundamental defect - 
> in my understanding of how MR is distributed. I discovered that even though I 
> was firing off 15 COREs, the map job - which is the most expensive part of my 
> processing was run only on 1 core.
> 
> To start my map job, I was creating a single file with following data:
>  1 storage:/root/1.manif.txt
>  2 storage:/root/2.manif.txt
>  3 storage:/root/3.manif.txt
>  ...
>  4000 storage:/root/4000.manif.txt
> 
> I thought that each of the available COREs will be assigned a map job from 
> top down from the same file one at a time, and as soon as one CORE is done, 
> it would get the next map job. However, it looks like I need to split the 
> file into the number of times. Now while that’s clearly trivial to do, I am 
> not sure how I can detect at runtime how many splits I need to do, and also 
> to deal with adding new CORES at runtime. Any suggestions?  (it doesn't have 
> to be a file, it can be a list, etc).
> 
> This all would be much easier to debug, if somehow I could get my log4j logs 
> for my mappers and reducers. I can see log4j for my main launcher, but not 
> sure how to enable it for mappers and reducers.
> 
> Thx
> - Akash
> 
> 
> -----Original Message----- From: Robert Evans
> Sent: Thursday, April 19, 2012 2:08 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Help me with architecture of a somewhat non-trivial mapreduce 
> implementation
> 
> From what I can see your implementation seems OK, especially from a 
> performance perspective. Depending on what storage: is it is likely to be 
> your bottlekneck, not the hadoop computations.
> 
> Because you are writing files directly instead of relying on Hadoop to do it 
> for you, you may need to deal with error cases that Hadoop will normally hide 
> from you, and you will not be able to turn on speculative execution. Just be 
> aware that a map or reduce task may have problems in the middle, and be 
> relaunched.  So when you are writing out your updated manifest be careful to 
> not replace the old one until the new one is completely ready and will not 
> fail, or you may lose data.  You may also need to be careful in your reduce 
> if you are writing directly to the file there too, but because it is not a 
> read modify write, but just a write it is not as critical.
> 
> --Bobby Evans
> 
> On 4/18/12 4:56 PM, "Sky USC" <sky...@hotmail.com> wrote:
> 
> 
> 
> 
> Please help me architect the design of my first significant MR task beyond 
> "word count". My program works well. but I am trying to optimize performance 
> to maximize use of available computing resources. I have 3 questions at the 
> bottom.
> 
> Project description in an abstract sense (written in java):
> * I have MM number of MANIFEST files available on storage:/root/1.manif.txt 
> to 4000.manif.txt
>    * Each MANIFEST in turn contains varilable number "EE" of URLs to EBOOKS 
> (range could be 10000 - 50,000 EBOOKS urls per MANIFEST) -- stored on 
> storage:/root/1.manif/1223.folder/5443.Ebook.ebk
> So we are talking about millions of ebooks
> 
> My task is to:
> 1. Fetch each ebook, and obtain a set of 3 attributes per ebook (example: 
> publisher, year, ebook-version).
> 2. Update each of the EBOOK entry record in the manifest - with the 3 
> attributes (eg: ebook 1334 -> publisher=aaa year=bbb, ebook-version=2.01)
> 3. Create a output file such that the named 
> "<publisher>_<year>_<ebook-version>"  contains a list of all "ebook urls" 
> that met that criteria.
> example:
> File "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" contains:
> storage:/root/1.manif/1223.folder/2143.Ebook.ebk
> storage:/root/2.manif/2133.folder/5449.Ebook.ebk
> storage:/root/2.manif/2133.folder/5450.Ebook.ebk
> etc..
> 
> and File "storage:/root/summary/PENGUIN_2001_3.12.txt" contains:
> storage:/root/19.manif/2223.folder/4343.Ebook.ebk
> storage:/root/13.manif/9733.folder/2149.Ebook.ebk
> storage:/root/21.manif/3233.folder/1110.Ebook.ebk
> 
> etc
> 
> 4. finally, I also want to output statistics such that:
> <publisher>_<year>_<ebook-version>  <COUNT_OF_URLs>
> PENGUIN_2001_3.12     250,111
> RANDOMHOUSE_1999_2.01  11,322
> etc
> 
> Here is how I implemented:
> * My launcher gets list of MM manifests
> * My Mapper gets one manifest.
> --- It reads the manifest, within a WHILE loop,
>   --- fetches each EBOOK,  and obtain attributes from each ebook,
>   --- updates the manifest for that ebook
>   --- context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
> Text("storage:/root/1.manif/1223.folder/2143.Ebook.ebk"))
> --- Once all ebooks in the manifest are read, it saves the updated Manifest, 
> and exits
> * My Reducer gets the "RANDOMHOUSE_1999_2.01" and a list of ebooks urls.
> --- It writes a new file "storage:/root/summary/RANDOMHOUSE_1999_2.01.txt" 
> with all the storage urls for the ebooks
> --- It also does a context.write(new Text("RANDOMHOUSE_1999_2.01"), new 
> IntWritable(SUM_OF_ALL_EBOOK_URLS_FROM_THE_LIST))
> 
> As I mentioned, its working. I launch it on 15 elastic instances. I have 
> three questions:
> 1. Is this the best way to implement the MR logic?
> 2. I dont know if each of the instances is getting one task or multiple tasks 
> simultaneously for the MAP portion. If it is not getting multiple MAP tasks, 
> should I go with the route of "multithreaded" reading of ebooks from each 
> manifest? Its not efficient to read just one ebook at a time per machine. Is 
> "Context.write()" threadsafe?
> 3. I can see log4j logs for main program, but no visibility into logs for 
> Mapper or Reducer. Any idea?
> 
> 
> 
> 
>

Re: Help me with architecture of a somewhat non-trivial mapreduce implementation

Reply via email to