Re: hadoop on single machine

[EMAIL PROTECTED] Sat, 01 Sep 2007 12:28:47 -0700

hi Tomislav,

Hi Renaud,
thank you for your reply. This is valuable information, but can you
elaborate a little bit more, like:


you say: Nutch is "always" using Hadoop.

I assume it does not uses Hadoop Distributed File System (HDFS) when
running on a single machine by default?

hadoop homepage says:  Hadoop implements MapReduce, using the HDFS

If there is no distributing file sistem over computer nodes (single
machine configuration) what does Hadoop do?

Well, you're not using the full potential of Hadoop's HDFS when usingNutch on a single machine (still, Hadoop is handling the map-reducelogic, the configuration objects, etc). It's like using a chainsaw tocut a toothpick ;-) Nevertheless, Nutch is a very good choice forsingle-machine deployments: high-performance, reliable and easy tocustomize.

When running crawl/recrawl cycle-> generate/fetch/update
what processes is Hadoop running?

Have a look at the class Crawl.java

How can I monitor them to see what is
going on (like how many urls are fetched and how many are still
unfetched from fetchlist)? Is there a GUI for this?

No GUI, but the command line tools can give you informations (e.g.readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)

you say: Fetching 100 sites of 1000 nodes with a single machine should
definitively be OK

What about recrawl on a regular basis (once a day or even more often)?

It depends on your configuration and connection, but you can expect tofetch 10-30 pages / second. So for 100K pages, it will take < 3hRe disk space, with an estimate of 10k/page for the index, it will takeyou ~1GB disk space

See more on http://wiki.apache.org/nutch/HardwareRequirements

HTH,
Renaud

Sorry if this are basic questions but I am trying to learn about nutch
and hadoop.
Thanks,
     Tomislav
On Thu, 2007-08-30 at 18:06 -0400, [EMAIL PROTECTED] wrote:
hi Tomislav,
The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodeswith a single machine should definitively be OK. You might want to addmore machines if many many people are searching your index.
BTW, Nutch is "always" using Hadoop. When testing locally or when usingonly one machine, Hadoop just uses the local file system. So even theNutchTutorial uses Hadoop.
HTH,
Renaud
Would it be recommended to use hadoop for crawling (100 sites with 1000
pages each) on a single machine? What would be the benefit?
Something like described on:
http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
machine.


Or is the simple crawl/recrawl (without hadoop, like described in nutch
tutorial on wiki:http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
way to go?

Thanks,
       Tomislav

Re: hadoop on single machine

Reply via email to