hi Tomislav,
The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
with a single machine should definitively be OK. You might want to add
more machines if many many people are searching your index.
BTW, Nutch is "always" using Hadoop. When testing locally or when using
only one machine, Hadoop just uses the local file system. So even the
NutchTutorial uses Hadoop.
HTH,
Renaud
Would it be recommended to use hadoop for crawling (100 sites with 1000
pages each) on a single machine? What would be the benefit?
Something like described on:
http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
machine.
Or is the simple crawl/recrawl (without hadoop, like described in nutch
tutorial on wiki:
http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
way to go?
Thanks,
Tomislav