Hi, Eric
Thank you for your reply.
With your tip, I was able to write the Map function and Reduce function to
generate the intermediate data and saved them to two files as data1.txt and
data2.txt, which I could easily achieved using AWK as I mentioned in the
previous mail. Well, generating the intermediate data in those formats is
not really a tough task for the moment, the MR jobs to generate the
intermediate data are so simple like the word counting one, but when I think
of making another MR functions to generate the final output(I will show the
format of the final output in the following), I don't know how to start
coding. Now the problem is how to write the Map function and Reduce function
to generate the final result with either the original data or the
intermediate data as the input(I am not sure whether or not the intermediate
data in those formats help here.).

The MapReduce paradigm seems quite simple, the Map function just collects
the input as key-value mapping, while the Reduce function merges the values
with the same key, it seems to me that it can do nothing more than just
that. But My requirement is not that simple, it involves grouping and
sorting. I hope I am understanding something wrong and you would point me
out.

Now I don't want to get too many things/tools involved to accomplish the
task, as you suggested I may code myself using Hadoop common. If I am not
reading wrong your words, Hadoop(without Pig and Hive) alone will just meet
my requirement.

the format of the final result I wish to get from the original data is like
the following:
111 222,333,444,888 ....(if there are more than 20 items here, I just want
the top 20)
222 111,333,444,888
333 111,222,444
444 111,222,333
888 111

Your help will be greatly appreciated.
- Kevin Tse

On Sat, May 29, 2010 at 2:07 AM, Eric Sammer <[email protected]> wrote:

> Kevin:
>
> This is certainly something Hadoop can do well. The easiest way to do
> this is in multiple map reduce jobs.
>
> Job 1: Group all files by folder.
>
> You can simply use Hadoop's grouping by key. In pseudo code, you would do:
>
> // key is byte offset in the text file, value is the line of text
> def map(key, value, collector):
>  list parts = value.split("\s")
>
>  collector.collect(parts[0], parts[1])
>
> def reduce(key, values):
>  // key is list1, values is all files for list1
>  buffer = ""
>
>  int counter = 0
>
>  for (value in values):
>    if (counter > 0):
>      buffer = buffer + ", "
>
>    buffer = buffer + value
>
>  collector.collect(key, buffer)
>
> This gets you your example data1.txt.
>
> For data2.txt, you would do another MR job over the original input
> file and simply make part[1] the key and part[0] the value in the
> map() method. You can then do what you need with the files from there.
> Selecting the top N is really just another MR job. Tools like Pig and
> Hive can do all of these operations for you giving you higher level
> languages saving you some coding, but it's probably a nice project to
> learn Hadoop by writing the code yourself. You should be able to start
> with the word count code and modify it to do this. Take a look at the
> Cloudera training videos to learn more.
> http://www.cloudera.com/resources/?type=Training
>
> Hope this helps.
>
> On Fri, May 28, 2010 at 6:34 AM, Kevin Tse <[email protected]>
> wrote:
> > Hi, all.
> > I have encountered a problem that cannot be solved with simple
> computation,
> > I don't know whether hadoop is applicable to it, I am completely new to
> > hadoop and MapReduce.
> > I have the raw data stored in a txt file weighing 700MB in size(100
> million
> > lines). the file is in the format as the following:
> > list1 111
> > list1 222
> > list1 333
> > list2 111
> > list2 222
> > list2 333
> > list2 444
> > list3 111
> > list3 888
> >
> > The first field of each line is something like a folder, the second one
> is
> > like a file. A file(the same file) can be saved under arbitrary amount of
> > different folders.
> > From this raw data, for the file "111", I want to collect the files that
> are
> > saved under the folders that contain the file "111", the file "111" is
> > excluded, and extract top 20 from these files which are sorting by their
> > appearance frequency in descending order.
> >
> > I was trying to solve this problem using AWK, but the script consumed too
> > much memory, and it was not fast enough.
> > And later I heard about hadoop. and from some tutorials on the web I
> learned
> > a little about how it works. I want to use its "Distributed Computing"
> > ability.
> > I have already read the word counting tutorial, but I still don't have an
> > idea about how to write my own Map function and Reduce function for the
> > problem.
> >
> > By the way, I can generate the intermediate data and save them to 2 files
> in
> > acceptable speed using AWK in the format as following:
> > data1.txt:
> > list1 111,222,333
> > list2 111,222,333,444
> > list3 111,888
> >
> > data2.txt:
> > 111 list1,list2,list3
> > 222 list1,list2
> > 333 list1,list3
> > 444 list2
> > 888 list4
> >
> > My question is:
> > Is hadoop applicable to this problem, if so, would you please give me a
> clue
> > on how to implement the Map function and the Reduce function.
> > Thank you in advance.
> >
> > - Kevin Tse
> >
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>

Reply via email to