Dear Stuti, As per the mail chain I uderstand you want to do SetJoin on two sets File1 and File2 with some join finction F(F1,F2). On this assumption, please find my reply below:
Set join is not simple and that too if input the input is very large. It essestially does a cartesian product between the two sets F1 and F2 and filter out the required data based on some function F(F1, F2). What i mean is say you have two files each with 10Lakh lines, then to perform a set join you essentially do 100Lakh operations and filter phase works on these 100Lakh results to filter out the required ones. Hence such a problem being exponentially inreasing in input size, it is helpful if you know how to Set-Join funciton works. having such insight is helpful. Though I have to admit, that these kind of problems are still under active reasearch, please refer links below for more detail: 1. http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks 2. http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides 3. http://research.microsoft.com/apps/pubs/default.aspx?id=76165 4. http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010 @Distributed cahce: Its not great if you have huge files. By Default you have to size limit of 10GB as the max size for a distributed file @launching jobs inside a mapper: Not a great idea, because for every key value you will launch a job and so essentially you will end up launching very huge number of jobs. Absolutely No No. A bug in production can bring down the cluster. Also its difficult to track all these jobs. Thank, Praveen On Wed, Apr 4, 2012 at 6:17 PM, <jagatsi...@gmail.com> wrote: > Hello Stuti > > The way you have explained it seems we can think about caching the file2 > already in nodes. > > -- Just out of context , In the same way replicated joins are being > handled in Pig in which one file (file2) to be joined is cached in the > memory by file1. > > Regards > > Jagat > > > > ----- Original Message ----- > > From: Stuti Awasthi > > Sent: 04/04/12 07:55 AM > > To: mapreduce-user@hadoop.apache.org > > Subject: RE: Calling one MR job within another MR job > > Hi Ravi, > > > > > > > > > > > > There is no job dependency so I cannot use chaining MR or JobControl as > you suggested. > > > > > > I have 2 relatively big files, I start processing with File1 as input to > MR1 job , now this processing required to find the data from File2. One way > to do is loop through File2 and get the data. Other way to pass File2 in > MR2 job for parallel processing. > > > > > > > > > > > > Second option is making hinting me to call an MR2 job inside from MR1 job. > I am sure this is the common problem that people usually face. What is the > best way to resolve this kind of issue. > > > > > > > > > > > > Thanks > > > > > > > > > > > > *From:* Ravi teja ch n v [mailto:raviteja.c...@huawei.com] > *Sent:* Wednesday, April 04, 2012 4:35 PM > *To:* mapreduce-user@hadoop.apache.org > *Subject:* RE: Calling one MR job within another MR job > > > > > > > > > > > > Hi Stuti, > > > > > > > > > > > > If you are looking for MRjob2 to run after MRjob1, ie the job dependency, > > > > > > you can use JobControl API, where you can manage the dependencies. > > > > > > > > > > > > Calling another Job from a Mapper is not a good idea. > > > > > > > > > > > > Thanks, > > > > > > Ravi Teja > > > > > > > > > > > ------------------------------ > > *From:* Stuti Awasthi [stutiawas...@hcl.com] > *Sent:* 04 April 2012 16:04:19 > *To:* mapreduce-user@hadoop.apache.org > *Subject:* Calling one MR job within another MR job > > > > > > Hi all, > > > > > > > > > > > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.txt > > > > > > So : > > > > > > MRjob1{ > > > > > > Map(){ > > > > > > MRJob2(File2.txt) > > > > > > } > > > > > > } > > > > > > > > > > > > MRJob2{ > > > > > > Processing…. > > > > > > } > > > > > > > > > > > > My queries are is this kind of approach is possible and how much are the > implications from the performance perspective. > > > > > > > > > > > > > > > > > > Regards, > > > > > > *Stuti Awasthi* > > > > > > HCL Comnet Systems and Services Ltd > > > > > > F-8/9 Basement, Sec-3,Noida. > > > > > > > > > > > > > > > > > ------------------------------ > > ::DISCLAIMER:: > > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect > the opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of > this message without the prior written consent of the author of this > e-mail is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > > ----------------------------------------------------------------------------------------------------------------------- > > > > > > > > > > >