Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join application within the hadoop-*examples.jar . I can't seem to figure it out, where am I going wrong? It isn't grouping the keys together, as I would expect.... ------------------------ > bin/hadoop dfs -cat join/a.txt AAAAAAAA,a0 BBBBBBBB,a1 CCCCCCCC,a2 CCCCCCCC,a3
> bin/hadoop dfs -cat join/b.txt AAAAAAAA,b0 BBBBBBBB,b1 BBBBBBBB,b2 BBBBBBBB,b3 > bin/hadoop dfs -cat join/c.txt AAAAAAAA,c0 BBBBBBBB,c1 DDDDDDDD,c2 DDDDDDDD,c3 > -----*RESULT*----- >bin/hadoop dfs -text theOutputs/part-00000 AAAAAAAA [a0] AAAAAAAA [b0] AAAAAAAA [c0] BBBBBBBB [c1] BBBBBBBB [a1] BBBBBBBB [b1] BBBBBBBB [b2] BBBBBBBB [b3] CCCCCCCC [a2] CCCCCCCC [a3] DDDDDDDD [c2] DDDDDDDD [c3] ----------------------- So, why has it not grouped all the AAAAAAAA's etc so that it, instead looks like this: AAAAAAAA [a0,b0,c0] BBBBBBBB [a1,b1,c1] BBBBBBBB [a1,b2,c1] BBBBBBBB [a1,b3,c1] CCCCCCCC [a2,,] CCCCCCCC [a3,,] DDDDDDDD [,,c2] DDDDDDDD [,,c3] ? --------------------- I have another question. Instead of these Key/Value pairs, what if I have two input files list1.txt and list2.txt, both containing a list of names, one line per name. I want to JOIN these input files BY the names in each list. i.e. I want to create an output file containing a list of the names that appear in both the input lists. Is it possible to adapt the Join example packaged with Hadoop to implement this? Many thanks, Rob Stewart
