Hi, all.
I have encountered a problem that cannot be solved with simple computation,
I don't know whether hadoop is applicable to it, I am completely new to
hadoop and MapReduce.
I have the raw data stored in a txt file weighing 700MB in size(100 million
lines). the file is in the format as the following:
list1 111
list1 222
list1 333
list2 111
list2 222
list2 333
list2 444
list3 111
list3 888

The first field of each line is something like a folder, the second one is
like a file. A file(the same file) can be saved under arbitrary amount of
different folders.
>From this raw data, for the file "111", I want to collect the files that are
saved under the folders that contain the file "111", the file "111" is
excluded, and extract top 20 from these files which are sorting by their
appearance frequency in descending order.

I was trying to solve this problem using AWK, but the script consumed too
much memory, and it was not fast enough.
And later I heard about hadoop. and from some tutorials on the web I learned
a little about how it works. I want to use its "Distributed Computing"
ability.
I have already read the word counting tutorial, but I still don't have an
idea about how to write my own Map function and Reduce function for the
problem.

By the way, I can generate the intermediate data and save them to 2 files in
acceptable speed using AWK in the format as following:
data1.txt:
list1 111,222,333
list2 111,222,333,444
list3 111,888

data2.txt:
111 list1,list2,list3
222 list1,list2
333 list1,list3
444 list2
888 list4

My question is:
Is hadoop applicable to this problem, if so, would you please give me a clue
on how to implement the Map function and the Reduce function.
Thank you in advance.

- Kevin Tse

Reply via email to