Hi, all. I have encountered a problem that cannot be solved with simple computation, I don't know whether hadoop is applicable to it, I am completely new to hadoop and MapReduce. I have the raw data stored in a txt file weighing 700MB in size(100 million lines). the file is in the format as the following: list1 111 list1 222 list1 333 list2 111 list2 222 list2 333 list2 444 list3 111 list3 888
The first field of each line is something like a folder, the second one is like a file. A file(the same file) can be saved under arbitrary amount of different folders. >From this raw data, for the file "111", I want to collect the files that are saved under the folders that contain the file "111", the file "111" is excluded, and extract top 20 from these files which are sorting by their appearance frequency in descending order. I was trying to solve this problem using AWK, but the script consumed too much memory, and it was not fast enough. And later I heard about hadoop. and from some tutorials on the web I learned a little about how it works. I want to use its "Distributed Computing" ability. I have already read the word counting tutorial, but I still don't have an idea about how to write my own Map function and Reduce function for the problem. By the way, I can generate the intermediate data and save them to 2 files in acceptable speed using AWK in the format as following: data1.txt: list1 111,222,333 list2 111,222,333,444 list3 111,888 data2.txt: 111 list1,list2,list3 222 list1,list2 333 list1,list3 444 list2 888 list4 My question is: Is hadoop applicable to this problem, if so, would you please give me a clue on how to implement the Map function and the Reduce function. Thank you in advance. - Kevin Tse
