Prabhu,

For nutch .7x the upper limit of webdb isn't
necessarily file size but hardware/computation size. 
You basically need 210% of your webdb size to do any
processing of it so if you have 100 million urls and a
1.5 terrabyte webdb you need (on the same server)  3.7
terrabytes of disk space to process the webdb, do the
updates, drop the tmp file and update your main webdb.

For .8/mapred branch it breaks down the jobs to
smaller map jobs and doesn't have one huge file that
consumes everything. You can better spend what was
wasted before distributed across more systems for
redundancy and performance.


Hopefully that makes sense.

-byron
--- Raghavendra Prabhu <[EMAIL PROTECTED]> wrote:

> Hi Stefan
> 
> Thanks for your mail
> 
> What i would like to know is (since i am using
> nutch-0.7) ,what is the upper
> limit on the webdb size if any such limit exists in
> nutch-0.7
> 
> Will the generate for a web db formed from one TB of
> data (just an example)
> work ?
> 
> And what is the  difference between webdb and
> nutch-0.8 (crawldb and linkdb)
> which makes it infinitely possible in nutch-0.8?
> 
> Rgds
> Prabhu
> 
> 
> 
> On 1/30/06, Stefan Groschupf <[EMAIL PROTECTED]>
> wrote:
> >
> > You can already use ndfs in 0.7, however in case
> the webdb is to
> > lareg it took to much time to generate segments.
> > So the problem is the webdb size, not the hdd
> limit.
> >
> > Am 30.01.2006 um 07:31 schrieb Raghavendra Prabhu:
> >
> > > Hi Stefan
> > >
> > > So can i assume that hard disk space is the only
> constraint in
> > > nutch-0.7
> > >
> > > In nutch-0.8 since you can store it over the
> ndfs , it is
> > > theoretically
> > > unlimited .
> > >
> > > Is my above mentioned point true ( In a nutshell
> , i want to know
> > > whether
> > > the only thing is the space for storing the
> nutch indexed date)
> > >
> > > I will try to do some testing and if possible
> contribute to wiki.
> > >
> > > Rgds
> > >
> > > Prabhu
> > >
> > >
> > >
> > > On 1/30/06, Stefan Groschupf
> <[EMAIL PROTECTED]> wrote:
> > >>
> > >> Any performance testing contribution to the
> wiki is welcome, I
> > >> guess. :)
> > >> So there are no such values except of some
> statements regarding
> > >> search speed in the wiki,
> > >> With nutch .8 there theoretically no size limit
> any more.
> > >> Stefan
> > >> Am 29.01.2006 um 13:35 schrieb Raghavendra
> Prabhu:
> > >>
> > >>> Is there any benchmark on how nutch performs
> > >>>
> > >>> I mean say like 1 GB of data is given as the
> input , how much time
> > >>> it will
> > >>> take to index this data on a 10 Mb/s network
> > >>>
> > >>> And while doing crawl what is the volume of
> data which can be
> > >>> loaded (this
> > >>> is in terms of search  . How much can a crawl
> segment hold )
> > >>>
> > >>> Is there any performance limit to it . Is the
> criteria only the
> > >>> space to
> > >>> store the indexed data .Is there any limit to
> it ?
> > >>>
> > >>> Rgds
> > >>>
> > >>> Prabhu
> > >>
> > >>
>
---------------------------------------------------------------
> > >> company:        http://www.media-style.com
> > >> forum:        http://www.text-mining.org
> > >> blog:            http://www.find23.net
> > >>
> > >>
> > >>
> > >>
> >
> >
> 



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to