Thanks everyone,

So with this discussion, there are 2 main opinions I got :

1.      Not to call one MR job from inside another MR job.

2.      Can use distributed cache (but not good for very large file).
I want to design the system so that I can efficiently do the processing. So if 
I run MR job to process File2 first and store its data in KeyValueFormat in 
HDFS.
Once this job is complete, I start with another MR job to process File1. Now 
since each I/p line of File1 will require  to get the some data from output of 
first MR job.

1.      Normal way to do is , For each input line for 2nd MR job, it will loop 
through the contents of output from MR job1 and get the relevant data for 
processing.

2.      Since I have stored output of File2 in key-value format, can I directly 
get the value for specific key.

So I want to know that if I have output1 in KeyValueFormat in HDFS. I run a 
separate job with different I/p file and wants to access data from output1 on 
the basis of keys, can we attain that without looping output1.

Thanks

From: Praveen Kumar K J V S [mailto:praveenkjvs.develo...@gmail.com]
Sent: Wednesday, April 04, 2012 6:43 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Calling one MR job within another MR job

Dear Stuti,

As per the mail chain I uderstand you want to do SetJoin on two sets File1 and 
File2 with some join finction F(F1,F2). On this assumption, please find my 
reply below:

Set join is not simple and that too if input the input is very large. It 
essestially does a cartesian product between the two sets F1 and F2 and filter 
out the required data based on some function F(F1, F2).

What i mean is say you have two files each with 10Lakh lines, then to perform a 
set join you essentially do 100Lakh operations and filter phase works on these 
100Lakh results to filter out the required ones.

Hence such a problem being exponentially inreasing in input size, it is helpful 
if you know how to Set-Join funciton works. having such insight is helpful.

Though I have to admit, that these kind of problems are still under active 
reasearch, please refer links below for more detail:


    *   http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks
    *   
http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides
    *   http://research.microsoft.com/apps/pubs/default.aspx?id=76165
    *   http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010
@Distributed cahce: Its not great if you have huge files. By Default you have 
to size limit of 10GB as the max size for a distributed file

@launching jobs inside a mapper: Not a great idea, because for every key value 
you will launch a job and so essentially you will end up launching  very huge 
number of jobs. Absolutely No No. A bug in production can bring down the 
cluster. Also its difficult to track all these jobs.

Thank,
Praveen
On Wed, Apr 4, 2012 at 6:17 PM, 
<jagatsi...@gmail.com<mailto:jagatsi...@gmail.com>> wrote:
Hello Stuti

The way you have explained it seems we can think about caching the file2 
already in nodes.

-- Just out of context , In the same way replicated joins are being handled in 
Pig in which one file (file2) to be joined is cached in the memory by file1.

Regards

Jagat



----- Original Message -----

From: Stuti Awasthi

Sent: 04/04/12 07:55 AM

To: mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>

Subject: RE: Calling one MR job within another MR job

Hi Ravi,









There is no job dependency so I cannot use chaining MR or JobControl as you 
suggested.




I have 2 relatively big files, I start processing with File1 as input to MR1 
job , now this processing required to find the data from File2. One way to do 
is loop through File2 and get the data. Other way to pass File2 in MR2 job for 
parallel processing.









Second option is making hinting me to call an MR2 job inside from MR1 job. I am 
sure this is the common problem that people usually face. What is the best way 
to resolve this  kind of issue.









Thanks









From: Ravi teja ch n v 
[mailto:raviteja.c...@huawei.com<mailto:raviteja.c...@huawei.com>]
Sent: Wednesday, April 04, 2012 4:35 PM
To: mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>
Subject: RE: Calling one MR job within another MR job










Hi Stuti,











If you are looking for MRjob2 to run after MRjob1, ie the job dependency,





you can use JobControl API, where you can manage the dependencies.











Calling another Job from a Mapper is not a good idea.











Thanks,





Ravi Teja











________________________________
From: Stuti Awasthi [stutiawas...@hcl.com<mailto:stutiawas...@hcl.com>]
Sent: 04 April 2012 16:04:19
To: mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>
Subject: Calling one MR job within another MR job




Hi all,









We have a usecase in which I start with first MR1 job with input file as 
File1.txt, and from this job, call another MR2 job with input as File2.txt




So :




MRjob1{




Map(){




MRJob2(File2.txt)




}




}









MRJob2{




Processing....




}









My queries are is this kind of approach is possible and how much are the 
implications from the performance perspective.














Regards,




Stuti Awasthi




HCL Comnet Systems and Services Ltd




F-8/9 Basement, Sec-3,Noida.















________________________________
::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. 
Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the 
opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of
this message without the prior written consent of the author of this e-mail is 
strictly prohibited. If you have
received this email in error please delete it and notify the sender 
immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------












Reply via email to