hi Tomislav,
Hi Renaud,
thank you for your reply. This is valuable information, but can you
elaborate a little bit more, like:

you say: Nutch is "always" using Hadoop.

I assume it does not uses Hadoop Distributed File System (HDFS) when
running on a single machine by default?

hadoop homepage says:  Hadoop implements MapReduce, using the HDFS

If there is no distributing file sistem over computer nodes (single
machine configuration) what does Hadoop do?
Well, you're not using the full potential of Hadoop's HDFS when using Nutch on a single machine (still, Hadoop is handling the map-reduce logic, the configuration objects, etc). It's like using a chainsaw to cut a toothpick ;-) Nevertheless, Nutch is a very good choice for single-machine deployments: high-performance, reliable and easy to customize.
When running crawl/recrawl cycle-> generate/fetch/update
what processes is Hadoop running?
Have a look at the class Crawl.java
How can I monitor them to see what is
going on (like how many urls are fetched and how many are still
unfetched from fetchlist)? Is there a GUI for this?
No GUI, but the command line tools can give you informations (e.g. readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)
you say: Fetching 100 sites of 1000 nodes with a single machine should
definitively be OK

What about recrawl on a regular basis (once a day or even more often)?
It depends on your configuration and connection, but you can expect to fetch 10-30 pages / second. So for 100K pages, it will take < 3h Re disk space, with an estimate of 10k/page for the index, it will take you ~1GB disk space
See more on http://wiki.apache.org/nutch/HardwareRequirements

HTH,
Renaud
Sorry if this are basic questions but I am trying to learn about nutch
and hadoop.
Thanks,
     Tomislav


On Thu, 2007-08-30 at 18:06 -0400, [EMAIL PROTECTED] wrote:
hi Tomislav,

The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes with a single machine should definitively be OK. You might want to add more machines if many many people are searching your index.

BTW, Nutch is "always" using Hadoop. When testing locally or when using only one machine, Hadoop just uses the local file system. So even the NutchTutorial uses Hadoop.

HTH,
Renaud

Would it be recommended to use hadoop for crawling (100 sites with 1000
pages each) on a single machine? What would be the benefit?
Something like described on:
http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
machine.


Or is the simple crawl/recrawl (without hadoop, like described in nutch
tutorial on wiki: http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
way to go?

Thanks,
       Tomislav





Reply via email to