Hi Renaud,
thank you for your reply. This is valuable information, but can you
elaborate a little bit more, like:

you say: Nutch is "always" using Hadoop.

I assume it does not uses Hadoop Distributed File System (HDFS) when
running on a single machine by default?

hadoop homepage says:  Hadoop implements MapReduce, using the HDFS

If there is no distributing file sistem over computer nodes (single
machine configuration) what does Hadoop do?

When running crawl/recrawl cycle-> generate/fetch/update
what processes is Hadoop running? How can I monitor them to see what is
going on (like how many urls are fetched and how many are still
unfetched from fetchlist)? Is there a GUI for this?

you say: Fetching 100 sites of 1000 nodes with a single machine should
definitively be OK

What about recrawl on a regular basis (once a day or even more often)?

Sorry if this are basic questions but I am trying to learn about nutch
and hadoop. 

Thanks,
     Tomislav

 


On Thu, 2007-08-30 at 18:06 -0400, [EMAIL PROTECTED] wrote:
> hi Tomislav,
> 
> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes 
> with a single machine should definitively be OK. You might want to add 
> more machines if many many people are searching your index.
> 
> BTW, Nutch is "always" using Hadoop. When testing locally or when using 
> only one machine, Hadoop just uses the local file system. So even the 
> NutchTutorial uses Hadoop.
> 
> HTH,
> Renaud
> 
> > Would it be recommended to use hadoop for crawling (100 sites with 1000
> > pages each) on a single machine? What would be the benefit?
> > Something like described on:
> > http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
> > machine.
> >
> >
> > Or is the simple crawl/recrawl (without hadoop, like described in nutch
> > tutorial on wiki:  
> > http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
> > way to go?
> >
> > Thanks,
> >        Tomislav
> >
> >
> >   
> 

Reply via email to