Dear Stuti,

As per the mail chain I uderstand you want to do SetJoin on two sets File1
and File2 with some join finction F(F1,F2). On this assumption, please find
my reply below:

Set join is not simple and that too if input the input is very large. It
essestially does a cartesian product between the two sets F1 and F2 and
filter out the required data based on some function F(F1, F2).

What i mean is say you have two files each with 10Lakh lines, then to
perform a set join you essentially do 100Lakh operations and filter
phase works on these 100Lakh results to filter out the required ones.

Hence such a problem being exponentially inreasing in input size, it is
helpful if you know how to Set-Join funciton works. having such insight is
helpful.

Though I have to admit, that these kind of problems are still under active
reasearch, please refer links below for more detail:


    1. http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks
      2.
      
http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides
      3. http://research.microsoft.com/apps/pubs/default.aspx?id=76165
      4. http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010

@Distributed cahce: Its not great if you have huge files. By Default you
have to size limit of 10GB as the max size for a distributed file

@launching jobs inside a mapper: Not a great idea, because for every key
value you will launch a job and so essentially you will end up launching
very huge number of jobs. Absolutely No No. A bug in production can bring
down the cluster. Also its difficult to track all these jobs.

Thank,
Praveen
On Wed, Apr 4, 2012 at 6:17 PM, <jagatsi...@gmail.com> wrote:

> Hello Stuti
>
> The way you have explained it seems we can think about caching the file2
> already in nodes.
>
> -- Just out of context , In the same way replicated joins are being
> handled in Pig in which one file (file2) to be joined is cached in the
> memory by file1.
>
> Regards
>
> Jagat
>
>
>
>  ----- Original Message -----
>
> From: Stuti Awasthi
>
> Sent: 04/04/12 07:55 AM
>
> To: mapreduce-user@hadoop.apache.org
>
> Subject: RE: Calling one MR job within another MR job
>
>   Hi Ravi,
>
>
>
>
>
>
>
>
>
>
>
> There is no job dependency so I cannot use chaining MR or JobControl as
> you suggested.
>
>
>
>
>
> I have 2 relatively big files, I start processing with File1 as input to
> MR1 job , now this processing required to find the data from File2. One way
> to do is loop through File2 and get the data. Other way to pass File2 in
> MR2 job for parallel processing.
>
>
>
>
>
>
>
>
>
>
>
> Second option is making hinting me to call an MR2 job inside from MR1 job.
> I am sure this is the common problem that people usually face. What is the
> best way to resolve this  kind of issue.
>
>
>
>
>
>
>
>
>
>
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
> *From:* Ravi teja ch n v [mailto:raviteja.c...@huawei.com]
> *Sent:* Wednesday, April 04, 2012 4:35 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* RE: Calling one MR job within another MR job
>
>
>
>
>
>
>
>
>
>
>
> Hi Stuti,
>
>
>
>
>
>
>
>
>
>
>
> If you are looking for MRjob2 to run after MRjob1, ie the job dependency,
>
>
>
>
>
> you can use JobControl API, where you can manage the dependencies.
>
>
>
>
>
>
>
>
>
>
>
> Calling another Job from a Mapper is not a good idea.
>
>
>
>
>
>
>
>
>
>
>
> Thanks,
>
>
>
>
>
> Ravi Teja
>
>
>
>
>
>
>
>
>
>
>  ------------------------------
>
> *From:* Stuti Awasthi [stutiawas...@hcl.com]
> *Sent:* 04 April 2012 16:04:19
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Calling one MR job within another MR job
>
>
>
>
>
> Hi all,
>
>
>
>
>
>
>
>
>
>
>
> We have a usecase in which I start with first MR1 job with input file as
> File1.txt, and from this job, call another MR2 job with input as File2.txt
>
>
>
>
>
> So :
>
>
>
>
>
> MRjob1{
>
>
>
>
>
> Map(){
>
>
>
>
>
> MRJob2(File2.txt)
>
>
>
>
>
> }
>
>
>
>
>
> }
>
>
>
>
>
>
>
>
>
>
>
> MRJob2{
>
>
>
>
>
> Processing….
>
>
>
>
>
> }
>
>
>
>
>
>
>
>
>
>
>
> My queries are is this kind of approach is possible and how much are the
> implications from the performance perspective.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Regards,
>
>
>
>
>
> *Stuti Awasthi*
>
>
>
>
>
> HCL Comnet Systems and Services Ltd
>
>
>
>
>
> F-8/9 Basement, Sec-3,Noida.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>  ------------------------------
>
> ::DISCLAIMER::
>
> -----------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its
> affiliates. Any views or opinions presented in
> this email are solely those of the author and may not necessarily reflect
> the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure,
> modification, distribution and / or publication of
> this message without the prior written consent of the author of this
> e-mail is strictly prohibited. If you have
> received this email in error please delete it and notify the sender
> immediately. Before opening any mail and
> attachments please check them for viruses and defect.
>
>
> -----------------------------------------------------------------------------------------------------------------------
>
>
>
>
>
>
>
>
>
>
>

Reply via email to