Aleksandar, Thank you so much, now I think I have all I need to give Hadoop a run in a cluster environment.
- Kevin Tse On Mon, May 31, 2010 at 2:54 PM, Aleksandar Stupar < [email protected]> wrote: > Hi guys, > > this looks to me as a set self join problem. As I see it the easiest way > to implement it using MR would be to use data1.txt as input format > (if you already have it, or generate it by MR): > > data1.txt: > list1 111,222,333 > list2 > 111,222,333,444 > list3 111,888 > > > In map function output all pairs of files per folder: > map(String folder, String[] files){ > for(String file1:files){ > for(String file2:files){ > if(file1!=file2) > output(file1,file2); > } > } > } > > > Now in the reduce function you have everything you need: > reduce (String file, String [] otherFiles){ > HashMap<String, long> mergedResults = new ... > for(String otherFile:otherFiles){ > mergedResults.get(otherFile).increaseCount(); > } > //emit top K results by count value > } > > > Hope this helps, > Aleksandar. > > > > > ________________________________ > From: Kevin Tse <[email protected]> > To: [email protected] > Sent: Sat, May 29, 2010 11:47:10 AM > Subject: Re: Is Hadoop applicable to this problem. > > Hi, Eric > Thank you for your reply. > With your tip, I was able to write the Map function and Reduce function to > generate the intermediate data and saved them to two files as data1.txt and > data2.txt, which I could easily achieved using AWK as I mentioned in the > previous mail. Well, generating the intermediate data in those formats is > not really a tough task for the moment, the MR jobs to generate the > intermediate data are so simple like the word counting one, but when I > think > of making another MR functions to generate the final output(I will show the > format of the final output in the following), I don't know how to start > coding. Now the problem is how to write the Map function and Reduce > function > to generate the final result with either the original data or the > intermediate data as the input(I am not sure whether or not the > intermediate > data in those formats help here.). > > The MapReduce paradigm seems quite simple, the Map function just collects > the input as key-value mapping, while the Reduce function merges the values > with the same key, it seems to me that it can do nothing more than just > that. But My requirement is not that simple, it involves grouping and > sorting. I hope I am understanding something wrong and you would point me > out. > > Now I don't want to get too many things/tools involved to accomplish the > task, as you suggested I may code myself using Hadoop common. If I am not > reading wrong your words, Hadoop(without Pig and Hive) alone will just meet > my requirement. > > the format of the final result I wish to get from the original data is like > the following: > 111 222,333,444,888 ....(if there are more than 20 items here, I just want > the top 20) > 222 111,333,444,888 > 333 111,222,444 > 444 111,222,333 > 888 111 > > Your help will be greatly appreciated. > - Kevin Tse > > On Sat, May 29, 2010 at 2:07 AM, Eric Sammer <[email protected]> wrote: > > > Kevin: > > > > This is certainly something Hadoop can do well. The easiest way to do > > this is in multiple map reduce jobs. > > > > Job 1: Group all files by folder. > > > > You can simply use Hadoop's grouping by key. In pseudo code, you would > do: > > > > // key is byte offset in the text file, value is the line of text > > def map(key, value, collector): > > list parts = value.split("\s") > > > > collector.collect(parts[0], parts[1]) > > > > def reduce(key, values): > > // key is list1, values is all files for list1 > > buffer = "" > > > > int counter = 0 > > > > for (value in values): > > if (counter > 0): > > buffer = buffer + ", " > > > > buffer = buffer + value > > > > collector.collect(key, buffer) > > > > This gets you your example data1.txt. > > > > For data2.txt, you would do another MR job over the original input > > file and simply make part[1] the key and part[0] the value in the > > map() method. You can then do what you need with the files from there. > > Selecting the top N is really just another MR job. Tools like Pig and > > Hive can do all of these operations for you giving you higher level > > languages saving you some coding, but it's probably a nice project to > > learn Hadoop by writing the code yourself. You should be able to start > > with the word count code and modify it to do this. Take a look at the > > Cloudera training videos to learn more. > > http://www.cloudera.com/resources/?type=Training > > > > Hope this helps. > > > > On Fri, May 28, 2010 at 6:34 AM, Kevin Tse <[email protected]> > > wrote: > > > Hi, all. > > > I have encountered a problem that cannot be solved with simple > > computation, > > > I don't know whether hadoop is applicable to it, I am completely new to > > > hadoop and MapReduce. > > > I have the raw data stored in a txt file weighing 700MB in size(100 > > million > > > lines). the file is in the format as the following: > > > list1 111 > > > list1 222 > > > list1 333 > > > list2 111 > > > list2 222 > > > list2 333 > > > list2 444 > > > list3 111 > > > list3 888 > > > > > > The first field of each line is something like a folder, the second one > > is > > > like a file. A file(the same file) can be saved under arbitrary amount > of > > > different folders. > > > From this raw data, for the file "111", I want to collect the files > that > > are > > > saved under the folders that contain the file "111", the file "111" is > > > excluded, and extract top 20 from these files which are sorting by > their > > > appearance frequency in descending order. > > > > > > I was trying to solve this problem using AWK, but the script consumed > too > > > much memory, and it was not fast enough. > > > And later I heard about hadoop. and from some tutorials on the web I > > learned > > > a little about how it works. I want to use its "Distributed Computing" > > > ability. > > > I have already read the word counting tutorial, but I still don't have > an > > > idea about how to write my own Map function and Reduce function for the > > > problem. > > > > > > By the way, I can generate the intermediate data and save them to 2 > files > > in > > > acceptable speed using AWK in the format as following: > > > data1.txt: > > > list1 111,222,333 > > > list2 111,222,333,444 > > > list3 111,888 > > > > > > data2.txt: > > > 111 list1,list2,list3 > > > 222 list1,list2 > > > 333 list1,list3 > > > 444 list2 > > > 888 list4 > > > > > > My question is: > > > Is hadoop applicable to this problem, if so, would you please give me a > > clue > > > on how to implement the Map function and the Reduce function. > > > Thank you in advance. > > > > > > - Kevin Tse > > > > > > > > > > > -- > > Eric Sammer > > phone: +1-917-287-2675 > > twitter: esammer > > data: www.cloudera.com > > > > > > >
