Re: Modeling WordCount in a different way

2009-04-07 Thread Aayush Garg
I want to investigate whether hadoop could handle streams.
Like data is coming as a infinite stream and hadoop is used to perform
online aggregation.
Hadoop comes with fault tolerance and other nice features so these are
directly used in such scenario.

On Tue, Apr 7, 2009 at 4:28 PM, Norbert Burger wrote:

> Aayush, out of curiosity, why do you want model wordcount this way?
> What benefit do you see?
>
> Norbert
>
> On 4/6/09, Aayush Garg  wrote:
> > Hi,
> >
> >  I want to make experiments with wordcount example in a different way.
> >
> >  Suppose we have very large data. Instead of splitting all the data one
> time,
> >  we want to feed some splits in the map-reduce job at a time. I want to
> model
> >  the hadoop job like this,
> >
> >  Suppose a batch of inputsplits arrive in the beginning to every map, and
> >  reduce gives the word, frequency for this batch of inputsplits.
> >  Now after this another batch of inputsplits arrive and the results from
> >  subsequent reduce are aggregated to the previous results(if the word
> "that"
> >  has frequency 2 in previous processing and in this processing it occurs
> 1
> >  time, then the frequency of "that" is now maintained as 3).
> >  In next map-reduce "that" comes 4 times, now its frequency maintained as
> >  7
> >
> >  And this process goes on like this.
> >  Now how would I model inputsplits like this and how these continuous
> >  map-reduces can be made running. In what way should I keep the results
> of
> >  Map-Reduces so that I could aggregate this with the output of next
> >  Map-reduce.
> >
> >  Thanks,
> >
> > Aayush
> >
>



-- 
Aayush Garg


Re: Modeling WordCount in a different way

2009-04-07 Thread Aayush Garg
I have confusion how would I start the next job after finishing the one,
could you just make it clear by some rough example. Also do I need to use
SequenceFileInputFormat to maintain the results in the memory and then
accessing it.

On Tue, Apr 7, 2009 at 10:43 AM, Sharad Agarwal wrote:

>
>
> > Suppose a batch of inputsplits arrive in the beginning to every map, and
> > reduce gives the word, frequency for this batch of inputsplits.
> > Now after this another batch of inputsplits arrive and the results from
> > subsequent reduce are aggregated to the previous results(if the word
> "that"
> > has frequency 2 in previous processing and in this processing it occurs 1
> > time, then the frequency of "that" is now maintained as 3).
> > In next map-reduce "that" comes 4 times, now its frequency maintained as
> > 7
> >
> you could merge the result from the previous step in the reducer. If the no
> of unique words are not large,  the output from the previous step can be
> loaded in the memory hash. This can be used to add the count from previous
> step to the current step.
> In case you expect the unique words list to be large to fit in memory. You
> could read the previous step output directly from the hdfs and since it
> would be a sorted file you could just walk it and merge the count in single
> pass in the reduce function.
>
> - Sharad
>



-- 
Aayush Garg,
Phone: +41 764822440


Modeling WordCount in a different way

2009-04-06 Thread Aayush Garg
Hi,

I want to make experiments with wordcount example in a different way.

Suppose we have very large data. Instead of splitting all the data one time,
we want to feed some splits in the map-reduce job at a time. I want to model
the hadoop job like this,

Suppose a batch of inputsplits arrive in the beginning to every map, and
reduce gives the word, frequency for this batch of inputsplits.
Now after this another batch of inputsplits arrive and the results from
subsequent reduce are aggregated to the previous results(if the word "that"
has frequency 2 in previous processing and in this processing it occurs 1
time, then the frequency of "that" is now maintained as 3).
In next map-reduce "that" comes 4 times, now its frequency maintained as
7

And this process goes on like this.
Now how would I model inputsplits like this and how these continuous
map-reduces can be made running. In what way should I keep the results of
Map-Reduces so that I could aggregate this with the output of next
Map-reduce.

Thanks,
Aayush


Optimized way

2008-12-04 Thread Aayush Garg
Hi,

I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
I am running a shell command in Map function of my program and this shell
command takes one file as an input. Many of such files are copied in the
HDFS.

So in summary map function will run a command like ./run 


Could you please suggest the optimized way to do this..like if I can use
multi core processing of nodes and many of such maps in parallel.

Thanks,
Aayush


Re: Error in start up

2008-04-23 Thread Aayush Garg
I put my username to R61neptun as you suggested but I am still getting that
error:

localhost: starting datanode, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-datanode-R61neptun.out
localhost: starting secondarynamenode, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-secondarynamenode-R61neptun.out
localhost: Exception in thread "main" java.lang.IllegalArgumentException:
port out of range:-1
localhost:  at
java.net.InetSocketAddress.(InetSocketAddress.java:118)
localhost:  at
org.apache.hadoop.dfs.DataNode.createSocketAddr(DataNode.java:104)
localhost:  at
org.apache.hadoop.dfs.SecondaryNameNode.(SecondaryNameNode.java:94)
localhost:  at
org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:481)
starting jobtracker, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-jobtracker-R61neptun.out
localhost: starting tasktracker, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-tasktracker-R61neptun.out

Could anyone tell about this error? I am just trying to run hadoop in pseudo
distributed mode.

Thanks,


On Tue, Apr 22, 2008 at 11:57 PM, Sujee Maniyam <[EMAIL PROTECTED]> wrote:

>
> >> logs/hadoop-root-datanode-R61-neptun.out
>
> May be this will help you:
>
> I am guessing - from the log file name above - that your hostname has
> underscores/dashes. (e.g.   R61-neptune).  Could you try to use the
> hostname
> without I underscores?  (e.g.   R61neptune or even simple 'hadoop').
>
> I had the same problem with Hadooop v0.16.3.  My hostnames were
> 'hadoop_master / hadoop_slave'.  And I was getting the 'Port of Out of
> Range
> -1' exception.  Once I eliminated the underscores (e.g.  master / slave)
> it
> started working.
>
> thanks
>
> --
> View this message in context:
> http://www.nabble.com/Error-in-start-up-tp16783362p16826259.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Error in start up

2008-04-21 Thread Aayush Garg
Could anyone please help me with this error below ? I am not able to start
HDFS due to this?

Thanks,

On Sat, Apr 19, 2008 at 7:25 PM, Aayush Garg <[EMAIL PROTECTED]> wrote:

> I have my hadoop-site.xml correct !! but it creates error in this way
>
>
> On Sat, Apr 19, 2008 at 6:35 PM, Stuart Sierra <[EMAIL PROTECTED]>
> wrote:
>
> > On Sat, Apr 19, 2008 at 9:53 AM, Aayush Garg <[EMAIL PROTECTED]>
> > wrote:
> > >  I am getting following error on start up the hadoop as pseudo
> > distributed::
> > >
> > >  bin/start-all.sh
> > >
> > >  localhost: starting datanode, logging to
> > >
> >  
> > /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-datanode-R61-neptun.out
> > >  localhost: starting secondarynamenode, logging to
> > >
> >  
> > /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-secondarynamenode-R61-neptun.out
> > >  localhost: Exception in thread "main"
> > java.lang.IllegalArgumentException:
> > >  port out of range:-1
> >
> > Hello, I'm a Hadoop newbie, but when I got this error it seemed to be
> > caused by an empty/incorrect hadoop-site.xml.  See
> >
> > http://wiki.apache.org/hadoop/QuickStart#head-530fd2e5b7fc3f35a210f3090f125416a79c2e1b
> >
> > -Stuart
> >
>
>
>
>


Re: Splitting in various files

2008-04-21 Thread Aayush Garg
I just tried the same thing (mapred.task.id)as you told..But I am getting
one file named null in my directory.

On Mon, Apr 21, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> Aayush Garg wrote:
>
> > Could anyone please tell?
> >
> > On Sat, Apr 19, 2008 at 1:33 PM, Aayush Garg <[EMAIL PROTECTED]>
> > wrote:
> >
> >
> >
> > > Hi,
> > >
> > > I have written the following code for writing my key,value pairs in
> > > the
> > > file, and this file is then read by another MR.
> > >
> > >   Path pth = new Path("./dir1/dir2/filename");
> > >   FileSystem fs = pth.getFileSystem(jobconf);
> > >   SequenceFile.Writer sqwrite = new
> > > SequenceFile.Writer(fs,conf,pth,Text.class,Custom.class);
> > >   sqwrite.append(Key,value);
> > >   sqwrite.close();
> > >
> > > I problem is I get my data written in one file(filename).. How can it
> > > be
> > > split across in the number of files. If I give only the path of
> > > directory in
> > >
> > >
> > What do you mean by splitting a file across multiple files? If you want
> a separate file for each map/reduce task then you can use conf.get("
> mapred.task.id") to get the task id that is unique for that task. Now you
> can name the file like
>
> Path pth = new Path("./dir1/dir2/" + filename + "-" + conf.get("
> mapred.task.id"));
>
> Amar
>
>  this progam then it does not get compiled.
> > >
> > > I give only the path of directory /dir1/dir2 to another Map Reduce and
> > > it
> > > reads the file.
> > >
> > > Thanks,
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>


Re: Splitting in various files

2008-04-20 Thread Aayush Garg
Could anyone please tell?

On Sat, Apr 19, 2008 at 1:33 PM, Aayush Garg <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have written the following code for writing my key,value pairs in the
> file, and this file is then read by another MR.
>
>Path pth = new Path("./dir1/dir2/filename");
>FileSystem fs = pth.getFileSystem(jobconf);
>SequenceFile.Writer sqwrite = new
> SequenceFile.Writer(fs,conf,pth,Text.class,Custom.class);
>sqwrite.append(Key,value);
>sqwrite.close();
>
> I problem is I get my data written in one file(filename).. How can it be
> split across in the number of files. If I give only the path of directory in
> this progam then it does not get compiled.
>
> I give only the path of directory /dir1/dir2 to another Map Reduce and it
> reads the file.
>
> Thanks,
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Error in start up

2008-04-19 Thread Aayush Garg
I have my hadoop-site.xml correct !! but it creates error in this way

On Sat, Apr 19, 2008 at 6:35 PM, Stuart Sierra <[EMAIL PROTECTED]>
wrote:

> On Sat, Apr 19, 2008 at 9:53 AM, Aayush Garg <[EMAIL PROTECTED]>
> wrote:
> >  I am getting following error on start up the hadoop as pseudo
> distributed::
> >
> >  bin/start-all.sh
> >
> >  localhost: starting datanode, logging to
> >
>  
> /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-datanode-R61-neptun.out
> >  localhost: starting secondarynamenode, logging to
> >
>  
> /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-secondarynamenode-R61-neptun.out
> >  localhost: Exception in thread "main"
> java.lang.IllegalArgumentException:
> >  port out of range:-1
>
> Hello, I'm a Hadoop newbie, but when I got this error it seemed to be
> caused by an empty/incorrect hadoop-site.xml.  See
>
> http://wiki.apache.org/hadoop/QuickStart#head-530fd2e5b7fc3f35a210f3090f125416a79c2e1b
>
> -Stuart
>



-- 
Aayush Garg,
Phone: +41 76 482 240


Error in start up

2008-04-19 Thread Aayush Garg
HI,

I am getting following error on start up the hadoop as pseudo distributed::

bin/start-all.sh

localhost: starting datanode, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-datanode-R61-neptun.out
localhost: starting secondarynamenode, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-secondarynamenode-R61-neptun.out
localhost: Exception in thread "main" java.lang.IllegalArgumentException:
port out of range:-1
localhost:  at
java.net.InetSocketAddress.(InetSocketAddress.java:118)
localhost:  at
org.apache.hadoop.dfs.DataNode.createSocketAddr(DataNode.java:104)
localhost:  at
org.apache.hadoop.dfs.SecondaryNameNode.(SecondaryNameNode.java:94)
localhost:  at
org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:481)
starting jobtracker, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-jobtracker-R61-neptun.out

Any idea of this error?

Thanks,


Splitting in various files

2008-04-19 Thread Aayush Garg
Hi,

I have written the following code for writing my key,value pairs in the
file, and this file is then read by another MR.

   Path pth = new Path("./dir1/dir2/filename");
   FileSystem fs = pth.getFileSystem(jobconf);
   SequenceFile.Writer sqwrite = new
SequenceFile.Writer(fs,conf,pth,Text.class,Custom.class);
   sqwrite.append(Key,value);
   sqwrite.close();

I problem is I get my data written in one file(filename).. How can it be
split across in the number of files. If I give only the path of directory in
this progam then it does not get compiled.

I give only the path of directory /dir1/dir2 to another Map Reduce and it
reads the file.

Thanks,


Re: Map reduce classes

2008-04-17 Thread Aayush Garg
Current structure of my program is::

Upper class{
class Reduce{
  reduce function(K1,V1,K2,V2){
// I count the frequency for each key
 // Add output in  HashMap(Key,value)  instead  of  output.collect()
   }
 }

void run()
 {
  runjob();
 // Now eliminate top frequency keys in HashMap built in reduce function
here because only now hashmap is complete.
 // Write this hashmap to a file in such a format so that I can use this
hashmap in next MapReduce job and key of this hashmap is taken as key in
mapper function of that Map Reduce. ?? How and which format should I
choose??? Is this design and approach ok?

  }

  public static void main() {}
}

I am trying my HashMap built in run() function to write in the file so that
another MapReduce can use it. For this I am doing::
FileSystem fs = new LocalFileSystem();
SequenceFile.Writer sqwrite = new SequenceFile.Writer(fs,conf,new
Path("./wordcount/works/"),Text.class, MyCustom.class);
Text dum = new Text("Harry");
sqwrite.append(dum, MyCustom_obj);
sqwrite.close();

I am getting the error as:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:272)
at
org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:815)
at
org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:808)
at org.Myorg.WordCount.run(WordCount.java:247)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

Why am I getting FileSystem create error?

Thanks,

On Thu, Apr 17, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Don't assume that any variables are shared between reducers or between
> maps,
> or between maps and reducers.
>
> If you want to share data, put it into HDFS.
>
>
> On 4/17/08 4:01 AM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>
> > One more thing:::
> > The HashMap that I am generating in the reduce phase will be on single
> node
> > or multiple nodes in the distributed enviornment? If my dataset is large
> > will this approach work? If not what can I do for this?
> > Also same thing with the file that I am writing in the run function
> (simple
> > file opening FileStream) ??
> >
> >
> >
> > On Thu, Apr 17, 2008 at 6:04 AM, Amar Kamat <[EMAIL PROTECTED]>
> wrote:
> >
> >> Ted Dunning wrote:
> >>
> >>> The easiest solution is to not worry too much about running an extra
> MR
> >>> step.
> >>>
> >>> So,
> >>>
> >>> - run a first pass to get the counts.  Use word count as the pattern.
> >>>  Store
> >>> the results in a file.
> >>>
> >>> - run the second pass.  You can now read the hash-table from the file
> >>> you
> >>> stored in pass 1.
> >>>
> >>> Another approach is to do the counting in your maps as specified and
> >>> then
> >>> before exiting, you can emit special records for each key to suppress.
> >>>  With
> >>> the correct sort and partition functions, you can make these killer
> >>> records
> >>> appear first in the reduce input.  Then, if your reducer sees the kill
> >>> flag
> >>> in the front of the values, it can avoid processing any extra data.
> >>>
> >>>
> >>>
> >> Ted,
> >> Will this work for the case where the cutoff frequency/count requires a
> >> global picture? I guess not.
> >>
> >>  In general, it is better to not try to communicate between map and
> reduce
> >>> except via the expected mechanisms.
> >>>
> >>>
> >>> On 4/16/08 1:33 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> >>>
> >>>
> >>>
> >>>> We can not read HashMap in the configure method of the reducer
> because
> >>>> it is
> >>>> called before reduce job.
> >>>> I need to eliminate rows from the HashMap when all the keys are read.
> >>>> Also my concern is if dataset is large will this HashMap thing work??
> >>>>
> >>>>
> >>>> On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]>
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> That design is fine.
> >>>>>
> >>>>> You should read your map in the configure method of the reducer.
> >>>>>
> >>>>> There is a MapFile format supported by Hadoop, but they tend to be
> >>>>>

Re: Map reduce classes

2008-04-17 Thread Aayush Garg
My latest problem is ::
I can not always rely on writing HashMap to file like this::

FileOutputStream fout = new FileOutputStream(f);
ObjectOutputStream objStream = new ObjectOutputStream(fout);
objStream.writeObject(HashMap);

This writing I am doing in the same run() of the outer class. The file can
be very big ...so can I write in such a manner that file is distributed and
I can read it easily in the next MapReduce Phase. Other way, can I split the
file when it becomes gerater than a certain size?

Thanks,
Aayush


On Thu, Apr 17, 2008 at 1:01 PM, Aayush Garg <[EMAIL PROTECTED]> wrote:

> One more thing:::
> The HashMap that I am generating in the reduce phase will be on single
> node or multiple nodes in the distributed enviornment? If my dataset is
> large will this approach work? If not what can I do for this?
> Also same thing with the file that I am writing in the run function
> (simple file opening FileStream) ??
>
>
>
>
> On Thu, Apr 17, 2008 at 6:04 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:
>
> > Ted Dunning wrote:
> >
> > > The easiest solution is to not worry too much about running an extra
> > > MR
> > > step.
> > >
> > > So,
> > >
> > > - run a first pass to get the counts.  Use word count as the pattern.
> > >  Store
> > > the results in a file.
> > >
> > > - run the second pass.  You can now read the hash-table from the file
> > > you
> > > stored in pass 1.
> > >
> > > Another approach is to do the counting in your maps as specified and
> > > then
> > > before exiting, you can emit special records for each key to suppress.
> > >  With
> > > the correct sort and partition functions, you can make these killer
> > > records
> > > appear first in the reduce input.  Then, if your reducer sees the kill
> > > flag
> > > in the front of the values, it can avoid processing any extra data.
> > >
> > >
> > >
> > Ted,
> > Will this work for the case where the cutoff frequency/count requires a
> > global picture? I guess not.
> >
> >  In general, it is better to not try to communicate between map and
> > > reduce
> > > except via the expected mechanisms.
> > >
> > >
> > > On 4/16/08 1:33 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> > >
> > >
> > >
> > > > We can not read HashMap in the configure method of the reducer
> > > > because it is
> > > > called before reduce job.
> > > > I need to eliminate rows from the HashMap when all the keys are
> > > > read.
> > > > Also my concern is if dataset is large will this HashMap thing
> > > > work??
> > > >
> > > >
> > > > On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > >
> > > >
> > > > > That design is fine.
> > > > >
> > > > > You should read your map in the configure method of the reducer.
> > > > >
> > > > > There is a MapFile format supported by Hadoop, but they tend to be
> > > > > pretty
> > > > > slow.  I usually find it better to just load my hash table by
> > > > > hand.  If
> > > > > you
> > > > > do this, you should use whatever format you like.
> > > > >
> > > > >
> > > > > On 4/16/08 12:41 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > HI,
> > > > > >
> > > > > > The current structure of my program is::
> > > > > > Upper class{
> > > > > > class Reduce{
> > > > > >  reduce function(K1,V1,K2,V2){
> > > > > >// I count the frequency for each key
> > > > > > // Add output in  HashMap(Key,value)  instead  of
> > > > > >  output.collect()
> > > > > >   }
> > > > > >  }
> > > > > >
> > > > > > void run()
> > > > > >  {
> > > > > >  runjob();
> > > > > > // Now eliminate top frequency keys in HashMap built in
> > > > > > reduce
> > > > > >
> > > > > >
> > > > > function
> > > > >
> > > > >
> > &

Re: Map reduce classes

2008-04-17 Thread Aayush Garg
One more thing:::
The HashMap that I am generating in the reduce phase will be on single node
or multiple nodes in the distributed enviornment? If my dataset is large
will this approach work? If not what can I do for this?
Also same thing with the file that I am writing in the run function (simple
file opening FileStream) ??



On Thu, Apr 17, 2008 at 6:04 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> Ted Dunning wrote:
>
> > The easiest solution is to not worry too much about running an extra MR
> > step.
> >
> > So,
> >
> > - run a first pass to get the counts.  Use word count as the pattern.
> >  Store
> > the results in a file.
> >
> > - run the second pass.  You can now read the hash-table from the file
> > you
> > stored in pass 1.
> >
> > Another approach is to do the counting in your maps as specified and
> > then
> > before exiting, you can emit special records for each key to suppress.
> >  With
> > the correct sort and partition functions, you can make these killer
> > records
> > appear first in the reduce input.  Then, if your reducer sees the kill
> > flag
> > in the front of the values, it can avoid processing any extra data.
> >
> >
> >
> Ted,
> Will this work for the case where the cutoff frequency/count requires a
> global picture? I guess not.
>
>  In general, it is better to not try to communicate between map and reduce
> > except via the expected mechanisms.
> >
> >
> > On 4/16/08 1:33 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> > > We can not read HashMap in the configure method of the reducer because
> > > it is
> > > called before reduce job.
> > > I need to eliminate rows from the HashMap when all the keys are read.
> > > Also my concern is if dataset is large will this HashMap thing work??
> > >
> > >
> > > On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > >
> > >
> > > > That design is fine.
> > > >
> > > > You should read your map in the configure method of the reducer.
> > > >
> > > > There is a MapFile format supported by Hadoop, but they tend to be
> > > > pretty
> > > > slow.  I usually find it better to just load my hash table by hand.
> > > >  If
> > > > you
> > > > do this, you should use whatever format you like.
> > > >
> > > >
> > > > On 4/16/08 12:41 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> > > >
> > > >
> > > >
> > > > > HI,
> > > > >
> > > > > The current structure of my program is::
> > > > > Upper class{
> > > > > class Reduce{
> > > > >  reduce function(K1,V1,K2,V2){
> > > > >// I count the frequency for each key
> > > > > // Add output in  HashMap(Key,value)  instead  of
> > > > >  output.collect()
> > > > >   }
> > > > >  }
> > > > >
> > > > > void run()
> > > > >  {
> > > > >  runjob();
> > > > > // Now eliminate top frequency keys in HashMap built in reduce
> > > > >
> > > > >
> > > > function
> > > >
> > > >
> > > > > here because only now hashmap is complete.
> > > > > // Write this hashmap to a file in such a format so that I can
> > > > > use
> > > > >
> > > > >
> > > > this
> > > >
> > > >
> > > > > hashmap in next MapReduce job and key of this hashmap is taken as
> > > > > key in
> > > > > mapper function of that Map Reduce. ?? How and which format should
> > > > > I
> > > > > choose??? Is this design and approach ok?
> > > > >
> > > > >  }
> > > > >
> > > > >  public static void main() {}
> > > > > }
> > > > > I hope you have got my question.
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > > On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]>
> > > > >
> > > > >
> > > > wrote:
> > > >
> > > >
> > > > > Aayush Garg wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Are you sure that another MR is required for eliminating some
> > > > > > > rows?
> > > > > > > Can't I
> > > > > > > just somehow eliminate from main() when I know the keys which
> > > > > > > are
> > > > > > >
> > > > > > >
> > > > > > needed
> > > >
> > > >
> > > > > to
> > > > > > > remove?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > Can you provide some more details on how exactly are you
> > > > > > filtering?
> > > > > > Amar
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Map reduce classes

2008-04-16 Thread Aayush Garg
Yes Alexandre you are right I can't do this::

if (frequency < threshold)
  output.collect(...);

in the reducer...

Ted,
I got some idea of your second approach. But as I am new to this could you
explain it with the help of the code?
In your this approach I would need only one map reduce job, and that I wil
first want to try.

Thanks,



On Wed, Apr 16, 2008 at 10:45 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> The easiest solution is to not worry too much about running an extra MR
> step.
>
> So,
>
> - run a first pass to get the counts.  Use word count as the pattern.
>  Store
> the results in a file.
>
> - run the second pass.  You can now read the hash-table from the file you
> stored in pass 1.
>
> Another approach is to do the counting in your maps as specified and then
> before exiting, you can emit special records for each key to suppress.
>  With
> the correct sort and partition functions, you can make these killer
> records
> appear first in the reduce input.  Then, if your reducer sees the kill
> flag
> in the front of the values, it can avoid processing any extra data.
>
> In general, it is better to not try to communicate between map and reduce
> except via the expected mechanisms.
>
>
>
> On 4/16/08 1:33 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>
> > We can not read HashMap in the configure method of the reducer because
> it is
> > called before reduce job.
> > I need to eliminate rows from the HashMap when all the keys are read.
> > Also my concern is if dataset is large will this HashMap thing work??
> >
> >
> > On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> That design is fine.
> >>
> >> You should read your map in the configure method of the reducer.
> >>
> >> There is a MapFile format supported by Hadoop, but they tend to be
> pretty
> >> slow.  I usually find it better to just load my hash table by hand.  If
> >> you
> >> do this, you should use whatever format you like.
> >>
> >>
> >> On 4/16/08 12:41 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> >>
> >>> HI,
> >>>
> >>> The current structure of my program is::
> >>> Upper class{
> >>> class Reduce{
> >>>   reduce function(K1,V1,K2,V2){
> >>> // I count the frequency for each key
> >>>  // Add output in  HashMap(Key,value)  instead  of
>  output.collect()
> >>>}
> >>>  }
> >>>
> >>> void run()
> >>>  {
> >>>   runjob();
> >>>  // Now eliminate top frequency keys in HashMap built in reduce
> >> function
> >>> here because only now hashmap is complete.
> >>>      // Write this hashmap to a file in such a format so that I can
> use
> >> this
> >>> hashmap in next MapReduce job and key of this hashmap is taken as key
> in
> >>> mapper function of that Map Reduce. ?? How and which format should I
> >>> choose??? Is this design and approach ok?
> >>>
> >>>   }
> >>>
> >>>   public static void main() {}
> >>> }
> >>> I hope you have got my question.
> >>>
> >>> Thanks,
> >>>
> >>>
> >>> On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]>
> >> wrote:
> >>>
> >>>> Aayush Garg wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Are you sure that another MR is required for eliminating some rows?
> >>>>> Can't I
> >>>>> just somehow eliminate from main() when I know the keys which are
> >> needed
> >>>>> to
> >>>>> remove?
> >>>>>
> >>>>>
> >>>>>
> >>>> Can you provide some more details on how exactly are you filtering?
> >>>> Amar
> >>>>
> >>>>
> >>>>
> >>
> >>
> >
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Map reduce classes

2008-04-16 Thread Aayush Garg
We can not read HashMap in the configure method of the reducer because it is
called before reduce job.
I need to eliminate rows from the HashMap when all the keys are read.
Also my concern is if dataset is large will this HashMap thing work??


On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> That design is fine.
>
> You should read your map in the configure method of the reducer.
>
> There is a MapFile format supported by Hadoop, but they tend to be pretty
> slow.  I usually find it better to just load my hash table by hand.  If
> you
> do this, you should use whatever format you like.
>
>
> On 4/16/08 12:41 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>
> > HI,
> >
> > The current structure of my program is::
> > Upper class{
> > class Reduce{
> >   reduce function(K1,V1,K2,V2){
> > // I count the frequency for each key
> >  // Add output in  HashMap(Key,value)  instead  of  output.collect()
> >}
> >  }
> >
> > void run()
> >  {
> >   runjob();
> >  // Now eliminate top frequency keys in HashMap built in reduce
> function
> > here because only now hashmap is complete.
> >  // Write this hashmap to a file in such a format so that I can use
> this
> > hashmap in next MapReduce job and key of this hashmap is taken as key in
> > mapper function of that Map Reduce. ?? How and which format should I
> > choose??? Is this design and approach ok?
> >
> >   }
> >
> >   public static void main() {}
> > }
> > I hope you have got my question.
> >
> > Thanks,
> >
> >
> > On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]>
> wrote:
> >
> >> Aayush Garg wrote:
> >>
> >>> Hi,
> >>>
> >>> Are you sure that another MR is required for eliminating some rows?
> >>> Can't I
> >>> just somehow eliminate from main() when I know the keys which are
> needed
> >>> to
> >>> remove?
> >>>
> >>>
> >>>
> >> Can you provide some more details on how exactly are you filtering?
> >> Amar
> >>
> >>
> >>
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Map reduce classes

2008-04-16 Thread Aayush Garg
HI,

The current structure of my program is::
Upper class{
class Reduce{
  reduce function(K1,V1,K2,V2){
// I count the frequency for each key
 // Add output in  HashMap(Key,value)  instead  of  output.collect()
   }
 }

void run()
 {
  runjob();
 // Now eliminate top frequency keys in HashMap built in reduce function
here because only now hashmap is complete.
 // Write this hashmap to a file in such a format so that I can use this
hashmap in next MapReduce job and key of this hashmap is taken as key in
mapper function of that Map Reduce. ?? How and which format should I
choose??? Is this design and approach ok?

  }

  public static void main() {}
}
I hope you have got my question.

Thanks,


On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> Aayush Garg wrote:
>
> > Hi,
> >
> > Are you sure that another MR is required for eliminating some rows?
> > Can't I
> > just somehow eliminate from main() when I know the keys which are needed
> > to
> > remove?
> >
> >
> >
> Can you provide some more details on how exactly are you filtering?
> Amar
>
>
>


Re: Map reduce classes

2008-04-15 Thread Aayush Garg
Hi,

Are you sure that another MR is required for eliminating some rows? Can't I
just somehow eliminate from main() when I know the keys which are needed to
remove?

for second one can I write first in SequenceFile format and then read it
using SequenceFileRecordReader? But I cant figure how will I exactly write
the code snippet?

Thanks,


On Wed, Apr 16, 2008 at 7:18 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> Aayush Garg wrote:
>
> > HI,
> > Could you please suggest what classes and another better way to achieve
> > this:-
> >
> > I am getting outputcollector in my reduce function as:
> >
> >  void reduce()
> > {
> >   output.collect(key,value);
> > }
> >
> > Here key is Text,
> > and value is Custom class type that I generated from rcc.
> >
> > 1.  After all calls are complete to reduce function, I need to eliminate
> > certain rows in this outputformat based on keys. I guess I need to store
> > this outputformat in some static Map(declared in Reduce class) and need
> > to
> > do required operations from the Main function. Is this right approach?
> >
> >
> I think you need to run another MR job for doing this record filtering.
>
> > 2.  This stored outputformat I want to use for another Map Reduce job.
> > What
> > classes and format should I use in the previous step so that I can
> > easily
> > use this as input in another program invoking MR job.
> >
> >
> The value class should implement Writable (see
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Writable.html).
> You need to write your own InputFormat (see
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job+Input)
> that will have a custom RecordReader (see
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#RecordReader
> ).
> Amar
>
> > Regards,
> > Garg
> >
> >
> >
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Map reduce classes

2008-04-15 Thread Aayush Garg
HI,
Could you please suggest what classes and another better way to achieve
this:-

I am getting outputcollector in my reduce function as:

 void reduce()
{
   output.collect(key,value);
}

Here key is Text,
and value is Custom class type that I generated from rcc.

1.  After all calls are complete to reduce function, I need to eliminate
certain rows in this outputformat based on keys. I guess I need to store
this outputformat in some static Map(declared in Reduce class) and need to
do required operations from the Main function. Is this right approach?
2.  This stored outputformat I want to use for another Map Reduce job. What
classes and format should I use in the previous step so that I can easily
use this as input in another program invoking MR job.

Regards,
Garg


Re: Sorting the OutputCollector

2008-04-09 Thread Aayush Garg
But the problem is that I need to sort according to freq which is the part
of my value field...
Any inputs?? Could you provide smal piece of code of your thought


On Wed, Apr 9, 2008 at 9:45 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

> On Apr 8, 2008, at 4:54 AM, Aayush Garg wrote:
>
>  I construct this type of key, value pairs in the outputcollector of
> > reduce
> > phase. Now I want to "SORT"  this   outputcollector in decreasing order
> > of
> > the value of frequency in Custom class.
> > Could some one suggest the possible way to do this?
> >
>
> In order to re-sort the output of the reduce, you need to run a second
> job, which has inputs of the first job's output.
>
> -- Owen
>



-- 
Aayush Garg,
Phone: +41 76 482 240


Sorting the OutputCollector

2008-04-08 Thread Aayush Garg
Hi,

I have implemented Key and value pairs in the following way:

Key (Text class) Value(Custom class)
word1
word2


class Custom{
int freq;
TreeMap>
}

I construct this type of key, value pairs in the outputcollector of reduce
phase. Now I want to "SORT"  this   outputcollector in decreasing order of
the value of frequency in Custom class.
Could some one suggest the possible way to do this?

Thanks,
Aayush


Sorting the OutputCollector

2008-04-08 Thread Aayush Garg
Hi,

I have implemented Key and value pairs in the following way:

Key (Text class) Value(Custom class)
word1
word2


class Custom{
int freq;
TreeMap>
}

I construct this type of key, value pairs in the outputcollector of reduce
phase. Now I want to "SORT"  this   outputcollector in decreasing order of
the value of frequency in Custom class.
Could some one suggest the possible way to do this?

Thanks,
Aayush


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
Please give your inputs for my problem.

Thanks,


On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <[EMAIL PROTECTED]> wrote:

> Ted,
>
> It appears that Nutch hasn't been updated in a while (in Internet time at
> least). Do you know if it works with the latest versions of Hadoop? Thanks.
>
> - Robert Dempsey (new to the list)
>
>
> On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
> >
> >
> > See Nutch.  See Nutch run.
> >
> > http://en.wikipedia.org/wiki/Nutch
> > http://lucene.apache.org/nutch/
> >
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
Hi,

I have not used lucene index ever before. I do not get how we build it with
hadoop Map reduce. Basically what I was looking for like how to implement
multilevel map/reduce for my mentioned problem.


On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote:

> You can build Lucene indexes using Hadoop Map/Reduce. See the index
> contrib package in the trunk. Or is it still not something you are
> looking for?
>
> Regards,
> Ning
>
> On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
> > No, currently my requirement is to solve this problem by apache hadoop.
> I am
> > trying to build up this type of inverted index and then measure
> performance
> > criteria with respect to others.
> >
> > Thanks,
> >
> >
> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> > >
> > > Are you implementing this for instruction or production?
> > >
> > > If production, why not use Lucene?
> > >
> > >
> > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> > >
> > > > HI  Amar , Theodore, Arun,
> > > >
> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
> out
> > > much.
> > > > I have written following code for inverted index. This code maps
> each
> > > word
> > > > from the document to its document id.
> > > > ex: apple file1 file123
> > > > Main functions of the code are:-
> > > >
> > > > public class HadoopProgram extends Configured implements Tool {
> > > > public static class MapClass extends MapReduceBase
> > > > implements Mapper {
> > > >
> > > > private final static IntWritable one = new IntWritable(1);
> > > > private Text word = new Text();
> > > > private Text doc = new Text();
> > > > private long numRecords=0;
> > > > private String inputFile;
> > > >
> > > >public void configure(JobConf job){
> > > > System.out.println("Configure function is called");
> > > > inputFile = job.get("map.input.file");
> > > > System.out.println("In conf the input file is"+inputFile);
> > > > }
> > > >
> > > >
> > > > public void map(LongWritable key, Text value,
> > > > OutputCollector output,
> > > > Reporter reporter) throws IOException {
> > > >   String line = value.toString();
> > > >   StringTokenizer itr = new StringTokenizer(line);
> > > >   doc.set(inputFile);
> > > >   while (itr.hasMoreTokens()) {
> > > > word.set(itr.nextToken());
> > > > output.collect(word,doc);
> > > >   }
> > > >   if(++numRecords%4==0){
> > > >System.out.println("Finished processing of input
> > > file"+inputFile);
> > > >  }
> > > > }
> > > >   }
> > > >
> > > >   /**
> > > >* A reducer class that just emits the sum of the input values.
> > > >*/
> > > >   public static class Reduce extends MapReduceBase
> > > > implements Reducer {
> > > >
> > > >   // This works as K2, V2, K3, V3
> > > > public void reduce(Text key, Iterator values,
> > > >OutputCollector output,
> > > >Reporter reporter) throws IOException {
> > > >   int sum = 0;
> > > >   Text dummy = new Text();
> > > >   ArrayList IDs = new ArrayList();
> > > >   String str;
> > > >
> > > >   while (values.hasNext()) {
> > > >  dummy = values.next();
> > > >  str = dummy.toString();
> > > >  IDs.add(str);
> > > >}
> > > >DocIDs dc = new DocIDs();
> > > >dc.setListdocs(IDs);
> > > >   output.collect(key,dc);
> > > > }
> > > >   }
> > > >
> > > >  public int run(String[] args) throws Exception {
> > > >   System.out.println("Run function is called");
> > > > JobConf conf = new JobConf(getConf(), WordCount.class);
> > > > conf.setJobName("wordcount");
> > > >
> > > > // the key

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
No, currently my requirement is to solve this problem by apache hadoop. I am
trying to build up this type of inverted index and then measure performance
criteria with respect to others.

Thanks,


On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Are you implementing this for instruction or production?
>
> If production, why not use Lucene?
>
>
> On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>
> > HI  Amar , Theodore, Arun,
> >
> > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> much.
> > I have written following code for inverted index. This code maps each
> word
> > from the document to its document id.
> > ex: apple file1 file123
> > Main functions of the code are:-
> >
> > public class HadoopProgram extends Configured implements Tool {
> > public static class MapClass extends MapReduceBase
> > implements Mapper {
> >
> > private final static IntWritable one = new IntWritable(1);
> > private Text word = new Text();
> > private Text doc = new Text();
> > private long numRecords=0;
> > private String inputFile;
> >
> >public void configure(JobConf job){
> > System.out.println("Configure function is called");
> > inputFile = job.get("map.input.file");
> > System.out.println("In conf the input file is"+inputFile);
> > }
> >
> >
> > public void map(LongWritable key, Text value,
> > OutputCollector output,
> > Reporter reporter) throws IOException {
> >   String line = value.toString();
> >   StringTokenizer itr = new StringTokenizer(line);
> >   doc.set(inputFile);
> >   while (itr.hasMoreTokens()) {
> > word.set(itr.nextToken());
> > output.collect(word,doc);
> >   }
> >   if(++numRecords%4==0){
> >System.out.println("Finished processing of input
> file"+inputFile);
> >  }
> > }
> >   }
> >
> >   /**
> >* A reducer class that just emits the sum of the input values.
> >*/
> >   public static class Reduce extends MapReduceBase
> > implements Reducer {
> >
> >   // This works as K2, V2, K3, V3
> > public void reduce(Text key, Iterator values,
> >OutputCollector output,
> >Reporter reporter) throws IOException {
> >   int sum = 0;
> >   Text dummy = new Text();
> >   ArrayList IDs = new ArrayList();
> >   String str;
> >
> >   while (values.hasNext()) {
> >  dummy = values.next();
> >  str = dummy.toString();
> >  IDs.add(str);
> >}
> >DocIDs dc = new DocIDs();
> >dc.setListdocs(IDs);
> >   output.collect(key,dc);
> > }
> >   }
> >
> >  public int run(String[] args) throws Exception {
> >   System.out.println("Run function is called");
> > JobConf conf = new JobConf(getConf(), WordCount.class);
> > conf.setJobName("wordcount");
> >
> > // the keys are words (strings)
> > conf.setOutputKeyClass(Text.class);
> >
> > conf.setOutputValueClass(Text.class);
> >
> >
> > conf.setMapperClass(MapClass.class);
> >
> > conf.setReducerClass(Reduce.class);
> > }
> >
> >
> > Now I am getting output array from the reducer as:-
> > word \root\test\test123, \root\test12
> >
> > In the next stage I want to stop 'stop  words',  scrub words etc. and
> like
> > position of the word in the document. How would I apply multiple maps or
> > multilevel map reduce jobs programmatically? I guess I need to make
> another
> > class or add some functions in it? I am not able to figure it out.
> > Any pointers for these type of problems?
> >
> > Thanks,
> > Aayush
> >
> >
> > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> wrote:
> >
> >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> >>
> >>> HI,
> >>> I am developing the simple inverted index program frm the hadoop. My
> map
> >>> function has the output:
> >>> 
> >>> and the reducer has:
> >>> 
> >>>
> >>> Now I want to use one more mapreduce to remove stop and scrub words
> from
> >> Use distributed cache as Arun mentioned.
> >>> this output. Also in the next stage I would like to have short summay
> >> Whether to use a separate MR job depends on what exactly you mean by
> >> summary. If its like a window around the current word then you can
> >> possibly do it in one go.
> >> Amar
> >>> associated with every word. How should I design my program from this
> >> stage?
> >>> I mean how would I apply multiple mapreduce to this? What would be the
> >>> better way to perform this?
> >>>
> >>> Thanks,
> >>>
> >>> Regards,
> >>> -
> >>>
> >>>
> >>
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Hadoop: Multiple map reduce or some better way

2008-04-03 Thread Aayush Garg
HI  Amar , Theodore, Arun,

Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
I have written following code for inverted index. This code maps each word
from the document to its document id.
ex: apple file1 file123
Main functions of the code are:-

public class HadoopProgram extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Text doc = new Text();
private long numRecords=0;
private String inputFile;

   public void configure(JobConf job){
System.out.println("Configure function is called");
inputFile = job.get("map.input.file");
System.out.println("In conf the input file is"+inputFile);
}


public void map(LongWritable key, Text value,
OutputCollector output,
Reporter reporter) throws IOException {
  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  doc.set(inputFile);
  while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word,doc);
  }
  if(++numRecords%4==0){
   System.out.println("Finished processing of input file"+inputFile);
 }
}
  }

  /**
   * A reducer class that just emits the sum of the input values.
   */
  public static class Reduce extends MapReduceBase
implements Reducer {

  // This works as K2, V2, K3, V3
public void reduce(Text key, Iterator values,
   OutputCollector output,
   Reporter reporter) throws IOException {
  int sum = 0;
  Text dummy = new Text();
  ArrayList IDs = new ArrayList();
  String str;

  while (values.hasNext()) {
 dummy = values.next();
 str = dummy.toString();
 IDs.add(str);
   }
   DocIDs dc = new DocIDs();
   dc.setListdocs(IDs);
  output.collect(key,dc);
}
  }

 public int run(String[] args) throws Exception {
  System.out.println("Run function is called");
JobConf conf = new JobConf(getConf(), WordCount.class);
conf.setJobName("wordcount");

// the keys are words (strings)
conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Text.class);


conf.setMapperClass(MapClass.class);

conf.setReducerClass(Reduce.class);
}


Now I am getting output array from the reducer as:-
word \root\test\test123, \root\test12

In the next stage I want to stop 'stop  words',  scrub words etc. and like
position of the word in the document. How would I apply multiple maps or
multilevel map reduce jobs programmatically? I guess I need to make another
class or add some functions in it? I am not able to figure it out.
Any pointers for these type of problems?

Thanks,
Aayush


On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> On Wed, 26 Mar 2008, Aayush Garg wrote:
>
> > HI,
> > I am developing the simple inverted index program frm the hadoop. My map
> > function has the output:
> > 
> > and the reducer has:
> > 
> >
> > Now I want to use one more mapreduce to remove stop and scrub words from
> Use distributed cache as Arun mentioned.
> > this output. Also in the next stage I would like to have short summay
> Whether to use a separate MR job depends on what exactly you mean by
> summary. If its like a window around the current word then you can
> possibly do it in one go.
> Amar
> > associated with every word. How should I design my program from this
> stage?
> > I mean how would I apply multiple mapreduce to this? What would be the
> > better way to perform this?
> >
> > Thanks,
> >
> > Regards,
> > -
> >
> >
>


Hadoop: Multiple map reduce or some better way

2008-03-26 Thread Aayush Garg
HI,
I am developing the simple inverted index program frm the hadoop. My map
function has the output:

and the reducer has:


Now I want to use one more mapreduce to remove stop and scrub words from
this output. Also in the next stage I would like to have short summay
associated with every word. How should I design my program from this stage?
I mean how would I apply multiple mapreduce to this? What would be the
better way to perform this?

Thanks,

Regards,
-
Aayush Garg,
Phone: +41 76 482 240