Re: Join Hadoop Example problem

Alex Kozlov Mon, 25 Jan 2010 22:27:20 -0800

Hi Rob, When you give Hive a directory name, it treats all the files as a
single table (kind of counterintuitive, but very helpful if you work with
large data sets).  Try to create 3 separate directories:


tablea/a.txt
tableb/b.txt
tablec/c.txt

and run the query as:

> bin/hadoop jar hadoop-*-examples.jar join -D
key.value.separator.in.input.line=',' -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat  -outKey
org.apache.hadoop.io.Text  join tablea tableb tablec theOutputs

Alex K
On Mon, Jan 25, 2010 at 6:25 PM, Rob Stewart <[email protected]>wrote:

> Good point, I missed that. It is:
>
> > bin/hadoop jar hadoop-*-examples.jar join -D
> key.value.separator.in.input.line=',' -inFormat
> org.apache.hadoop.mapred.KeyValueTextInputFormat  -outKey
> org.apache.hadoop.io.Text  join/  theOutputs
>
> Rob
>
>
> 2010/1/26 abhishek sharma <[email protected]>
>
> > What is the exact command that you are giving when submitting the
> > jobs? I did not see it in your e-mail.
> >
> > Abhishek
> >
> > On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart
> > <[email protected]> wrote:
> > > Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join
> > application
> > > within the hadoop-*examples.jar . I can't seem to figure it out, where
> am
> > I
> > > going wrong? It isn't grouping the keys together, as I would expect....
> > > ------------------------
> > >> bin/hadoop dfs -cat join/a.txt
> > > AAAAAAAA,a0
> > > BBBBBBBB,a1
> > > CCCCCCCC,a2
> > > CCCCCCCC,a3
> > >
> > >> bin/hadoop dfs -cat join/b.txt
> > > AAAAAAAA,b0
> > > BBBBBBBB,b1
> > > BBBBBBBB,b2
> > > BBBBBBBB,b3
> > >
> > >> bin/hadoop dfs -cat join/c.txt
> > > AAAAAAAA,c0
> > > BBBBBBBB,c1
> > > DDDDDDDD,c2
> > > DDDDDDDD,c3
> > >
> > >>
> > >
> > > -----*RESULT*-----
> > >>bin/hadoop dfs -text theOutputs/part-00000
> > > AAAAAAAA        [a0]
> > > AAAAAAAA        [b0]
> > > AAAAAAAA        [c0]
> > > BBBBBBBB        [c1]
> > > BBBBBBBB        [a1]
> > > BBBBBBBB        [b1]
> > > BBBBBBBB        [b2]
> > > BBBBBBBB        [b3]
> > > CCCCCCCC        [a2]
> > > CCCCCCCC        [a3]
> > > DDDDDDDD        [c2]
> > > DDDDDDDD        [c3]
> > > -----------------------
> > >
> > >
> > > So, why has it not grouped all the AAAAAAAA's etc so that it, instead
> > looks
> > > like this:
> > >
> > > AAAAAAAA        [a0,b0,c0]
> > > BBBBBBBB        [a1,b1,c1]
> > > BBBBBBBB        [a1,b2,c1]
> > > BBBBBBBB        [a1,b3,c1]
> > > CCCCCCCC        [a2,,]
> > > CCCCCCCC        [a3,,]
> > > DDDDDDDD        [,,c2]
> > > DDDDDDDD        [,,c3]
> > >
> > > ?
> > >
> > > ---------------------
> > >
> > > I have another question. Instead of these Key/Value pairs, what if I
> > > have two input files list1.txt and list2.txt, both containing a list
> > > of names, one line per name. I want to JOIN these input files BY the
> > > names in each list. i.e. I want to create an output file containing a
> > > list of the names that appear in both the input lists. Is it possible
> > > to adapt the Join example packaged with Hadoop to implement this?
> > >
> > >
> > > Many thanks,
> > >
> > > Rob Stewart
> > >
> >
>

Re: Join Hadoop Example problem

Reply via email to