Hi Renaud,
what would be a recommended hardware specification for a machine running
searcher web application with 15K users per day using this index (100K
pages)? What is a good practice for
getting index from crawl machine to search machine (if using separate
machines for crawl and search)?
Thanks,
Tomislav
2007/9/1, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:
>
> hi Tomislav,
> > Hi Renaud,
> > thank you for your reply. This is valuable information, but can you
> > elaborate a little bit more, like:
> >
> > you say: Nutch is "always" using Hadoop.
> >
> > I assume it does not uses Hadoop Distributed File System (HDFS) when
> > running on a single machine by default?
> >
> > hadoop homepage says: Hadoop implements MapReduce, using the HDFS
> >
> > If there is no distributing file sistem over computer nodes (single
> > machine configuration) what does Hadoop do?
> >
> Well, you're not using the full potential of Hadoop's HDFS when using
> Nutch on a single machine (still, Hadoop is handling the map-reduce
> logic, the configuration objects, etc). It's like using a chainsaw to
> cut a toothpick ;-) Nevertheless, Nutch is a very good choice for
> single-machine deployments: high-performance, reliable and easy to
> customize.
> > When running crawl/recrawl cycle-> generate/fetch/update
> > what processes is Hadoop running?
> Have a look at the class Crawl.java
> > How can I monitor them to see what is
> > going on (like how many urls are fetched and how many are still
> > unfetched from fetchlist)? Is there a GUI for this?
> >
> No GUI, but the command line tools can give you informations (e.g.
> readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)
> > you say: Fetching 100 sites of 1000 nodes with a single machine should
> > definitively be OK
> >
> > What about recrawl on a regular basis (once a day or even more often)?
> >
> It depends on your configuration and connection, but you can expect to
> fetch 10-30 pages / second. So for 100K pages, it will take < 3h
> Re disk space, with an estimate of 10k/page for the index, it will take
> you ~1GB disk space
> See more on http://wiki.apache.org/nutch/HardwareRequirements
>
> HTH,
> Renaud
> > Sorry if this are basic questions but I am trying to learn about nutch
> > and hadoop.
> >
> > Thanks,
> > Tomislav
> >
> >
> >
> >
> > On Thu, 2007-08-30 at 18:06 -0400, [EMAIL PROTECTED] wrote:
> >
> >> hi Tomislav,
> >>
> >> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
> >> with a single machine should definitively be OK. You might want to add
> >> more machines if many many people are searching your index.
> >>
> >> BTW, Nutch is "always" using Hadoop. When testing locally or when using
> >> only one machine, Hadoop just uses the local file system. So even the
> >> NutchTutorial uses Hadoop.
> >>
> >> HTH,
> >> Renaud
> >>
> >>
> >>> Would it be recommended to use hadoop for crawling (100 sites with
> 1000
> >>> pages each) on a single machine? What would be the benefit?
> >>> Something like described on:
> >>> http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
> >>> machine.
> >>>
> >>>
> >>> Or is the simple crawl/recrawl (without hadoop, like described in
> nutch
> >>> tutorial on wiki:
> >>> http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
> >>> way to go?
> >>>
> >>> Thanks,
> >>> Tomislav
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>
>