Hi Rob, When you give Hive a directory name, it treats all the files as a single table (kind of counterintuitive, but very helpful if you work with large data sets). Try to create 3 separate directories:
tablea/a.txt tableb/b.txt tablec/c.txt and run the query as: > bin/hadoop jar hadoop-*-examples.jar join -D key.value.separator.in.input.line=',' -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat -outKey org.apache.hadoop.io.Text join tablea tableb tablec theOutputs Alex K On Mon, Jan 25, 2010 at 6:25 PM, Rob Stewart <[email protected]>wrote: > Good point, I missed that. It is: > > > bin/hadoop jar hadoop-*-examples.jar join -D > key.value.separator.in.input.line=',' -inFormat > org.apache.hadoop.mapred.KeyValueTextInputFormat -outKey > org.apache.hadoop.io.Text join/ theOutputs > > Rob > > > 2010/1/26 abhishek sharma <[email protected]> > > > What is the exact command that you are giving when submitting the > > jobs? I did not see it in your e-mail. > > > > Abhishek > > > > On Mon, Jan 25, 2010 at 5:43 PM, Rob Stewart > > <[email protected]> wrote: > > > Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join > > application > > > within the hadoop-*examples.jar . I can't seem to figure it out, where > am > > I > > > going wrong? It isn't grouping the keys together, as I would expect.... > > > ------------------------ > > >> bin/hadoop dfs -cat join/a.txt > > > AAAAAAAA,a0 > > > BBBBBBBB,a1 > > > CCCCCCCC,a2 > > > CCCCCCCC,a3 > > > > > >> bin/hadoop dfs -cat join/b.txt > > > AAAAAAAA,b0 > > > BBBBBBBB,b1 > > > BBBBBBBB,b2 > > > BBBBBBBB,b3 > > > > > >> bin/hadoop dfs -cat join/c.txt > > > AAAAAAAA,c0 > > > BBBBBBBB,c1 > > > DDDDDDDD,c2 > > > DDDDDDDD,c3 > > > > > >> > > > > > > -----*RESULT*----- > > >>bin/hadoop dfs -text theOutputs/part-00000 > > > AAAAAAAA [a0] > > > AAAAAAAA [b0] > > > AAAAAAAA [c0] > > > BBBBBBBB [c1] > > > BBBBBBBB [a1] > > > BBBBBBBB [b1] > > > BBBBBBBB [b2] > > > BBBBBBBB [b3] > > > CCCCCCCC [a2] > > > CCCCCCCC [a3] > > > DDDDDDDD [c2] > > > DDDDDDDD [c3] > > > ----------------------- > > > > > > > > > So, why has it not grouped all the AAAAAAAA's etc so that it, instead > > looks > > > like this: > > > > > > AAAAAAAA [a0,b0,c0] > > > BBBBBBBB [a1,b1,c1] > > > BBBBBBBB [a1,b2,c1] > > > BBBBBBBB [a1,b3,c1] > > > CCCCCCCC [a2,,] > > > CCCCCCCC [a3,,] > > > DDDDDDDD [,,c2] > > > DDDDDDDD [,,c3] > > > > > > ? > > > > > > --------------------- > > > > > > I have another question. Instead of these Key/Value pairs, what if I > > > have two input files list1.txt and list2.txt, both containing a list > > > of names, one line per name. I want to JOIN these input files BY the > > > names in each list. i.e. I want to create an output file containing a > > > list of the names that appear in both the input lists. Is it possible > > > to adapt the Join example packaged with Hadoop to implement this? > > > > > > > > > Many thanks, > > > > > > Rob Stewart > > > > > >
