hi Tomislav,
Hi Renaud,
thank you for your reply. This is valuable information, but can you
elaborate a little bit more, like:
you say: Nutch is "always" using Hadoop.
I assume it does not uses Hadoop Distributed File System (HDFS) when
running on a single machine by default?
hadoop homepage says: Hadoop implements MapReduce, using the HDFS
If there is no distributing file sistem over computer nodes (single
machine configuration) what does Hadoop do?
Well, you're not using the full potential of Hadoop's HDFS when using
Nutch on a single machine (still, Hadoop is handling the map-reduce
logic, the configuration objects, etc). It's like using a chainsaw to
cut a toothpick ;-) Nevertheless, Nutch is a very good choice for
single-machine deployments: high-performance, reliable and easy to
customize.
When running crawl/recrawl cycle-> generate/fetch/update
what processes is Hadoop running?
Have a look at the class Crawl.java
How can I monitor them to see what is
going on (like how many urls are fetched and how many are still
unfetched from fetchlist)? Is there a GUI for this?
No GUI, but the command line tools can give you informations (e.g.
readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)
you say: Fetching 100 sites of 1000 nodes with a single machine should
definitively be OK
What about recrawl on a regular basis (once a day or even more often)?
It depends on your configuration and connection, but you can expect to
fetch 10-30 pages / second. So for 100K pages, it will take < 3h
Re disk space, with an estimate of 10k/page for the index, it will take
you ~1GB disk space
See more on http://wiki.apache.org/nutch/HardwareRequirements
HTH,
Renaud
Sorry if this are basic questions but I am trying to learn about nutch
and hadoop.
Thanks,
Tomislav
On Thu, 2007-08-30 at 18:06 -0400, [EMAIL PROTECTED] wrote:
hi Tomislav,
The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
with a single machine should definitively be OK. You might want to add
more machines if many many people are searching your index.
BTW, Nutch is "always" using Hadoop. When testing locally or when using
only one machine, Hadoop just uses the local file system. So even the
NutchTutorial uses Hadoop.
HTH,
Renaud
Would it be recommended to use hadoop for crawling (100 sites with 1000
pages each) on a single machine? What would be the benefit?
Something like described on:
http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
machine.
Or is the simple crawl/recrawl (without hadoop, like described in nutch
tutorial on wiki:
http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
way to go?
Thanks,
Tomislav