Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join application
within the hadoop-*examples.jar . I can't seem to figure it out, where am I
going wrong? It isn't grouping the keys together, as I would expect....
------------------------
> bin/hadoop dfs -cat join/a.txt
AAAAAAAA,a0
BBBBBBBB,a1
CCCCCCCC,a2
CCCCCCCC,a3

> bin/hadoop dfs -cat join/b.txt
AAAAAAAA,b0
BBBBBBBB,b1
BBBBBBBB,b2
BBBBBBBB,b3

> bin/hadoop dfs -cat join/c.txt
AAAAAAAA,c0
BBBBBBBB,c1
DDDDDDDD,c2
DDDDDDDD,c3

>

-----*RESULT*-----
>bin/hadoop dfs -text theOutputs/part-00000
AAAAAAAA        [a0]
AAAAAAAA        [b0]
AAAAAAAA        [c0]
BBBBBBBB        [c1]
BBBBBBBB        [a1]
BBBBBBBB        [b1]
BBBBBBBB        [b2]
BBBBBBBB        [b3]
CCCCCCCC        [a2]
CCCCCCCC        [a3]
DDDDDDDD        [c2]
DDDDDDDD        [c3]
-----------------------


So, why has it not grouped all the AAAAAAAA's etc so that it, instead looks
like this:

AAAAAAAA        [a0,b0,c0]
BBBBBBBB        [a1,b1,c1]
BBBBBBBB        [a1,b2,c1]
BBBBBBBB        [a1,b3,c1]
CCCCCCCC        [a2,,]
CCCCCCCC        [a3,,]
DDDDDDDD        [,,c2]
DDDDDDDD        [,,c3]

?

---------------------

I have another question. Instead of these Key/Value pairs, what if I
have two input files list1.txt and list2.txt, both containing a list
of names, one line per name. I want to JOIN these input files BY the
names in each list. i.e. I want to create an output file containing a
list of the names that appear in both the input lists. Is it possible
to adapt the Join example packaged with Hadoop to implement this?


Many thanks,

Rob Stewart

Reply via email to