Hi Renaud,
thank you for your reply. This is valuable information, but can you
elaborate a little bit more, like:
you say: Nutch is "always" using Hadoop.
I assume it does not uses Hadoop Distributed File System (HDFS) when
running on a single machine by default?
hadoop homepage says: Hadoop implements MapReduce, using the HDFS
If there is no distributing file sistem over computer nodes (single
machine configuration) what does Hadoop do?
When running crawl/recrawl cycle-> generate/fetch/update
what processes is Hadoop running? How can I monitor them to see what is
going on (like how many urls are fetched and how many are still
unfetched from fetchlist)? Is there a GUI for this?
you say: Fetching 100 sites of 1000 nodes with a single machine should
definitively be OK
What about recrawl on a regular basis (once a day or even more often)?
Sorry if this are basic questions but I am trying to learn about nutch
and hadoop.
Thanks,
Tomislav
On Thu, 2007-08-30 at 18:06 -0400, [EMAIL PROTECTED] wrote:
> hi Tomislav,
>
> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
> with a single machine should definitively be OK. You might want to add
> more machines if many many people are searching your index.
>
> BTW, Nutch is "always" using Hadoop. When testing locally or when using
> only one machine, Hadoop just uses the local file system. So even the
> NutchTutorial uses Hadoop.
>
> HTH,
> Renaud
>
> > Would it be recommended to use hadoop for crawling (100 sites with 1000
> > pages each) on a single machine? What would be the benefit?
> > Something like described on:
> > http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
> > machine.
> >
> >
> > Or is the simple crawl/recrawl (without hadoop, like described in nutch
> > tutorial on wiki:
> > http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
> > way to go?
> >
> > Thanks,
> > Tomislav
> >
> >
> >
>