Hi Thanks . I got the explanation .So with mapreduce we will be able to process crawldb efficiently
Rgds Prabhu On 1/31/06, Byron Miller <[EMAIL PROTECTED]> wrote: > > Prabhu, > > For nutch .7x the upper limit of webdb isn't > necessarily file size but hardware/computation size. > You basically need 210% of your webdb size to do any > processing of it so if you have 100 million urls and a > 1.5 terrabyte webdb you need (on the same server) 3.7 > terrabytes of disk space to process the webdb, do the > updates, drop the tmp file and update your main webdb. > > For .8/mapred branch it breaks down the jobs to > smaller map jobs and doesn't have one huge file that > consumes everything. You can better spend what was > wasted before distributed across more systems for > redundancy and performance. > > > Hopefully that makes sense. > > -byron > --- Raghavendra Prabhu <[EMAIL PROTECTED]> wrote: > > > Hi Stefan > > > > Thanks for your mail > > > > What i would like to know is (since i am using > > nutch-0.7) ,what is the upper > > limit on the webdb size if any such limit exists in > > nutch-0.7 > > > > Will the generate for a web db formed from one TB of > > data (just an example) > > work ? > > > > And what is the difference between webdb and > > nutch-0.8 (crawldb and linkdb) > > which makes it infinitely possible in nutch-0.8? > > > > Rgds > > Prabhu > > > > > > > > On 1/30/06, Stefan Groschupf <[EMAIL PROTECTED]> > > wrote: > > > > > > You can already use ndfs in 0.7, however in case > > the webdb is to > > > lareg it took to much time to generate segments. > > > So the problem is the webdb size, not the hdd > > limit. > > > > > > Am 30.01.2006 um 07:31 schrieb Raghavendra Prabhu: > > > > > > > Hi Stefan > > > > > > > > So can i assume that hard disk space is the only > > constraint in > > > > nutch-0.7 > > > > > > > > In nutch-0.8 since you can store it over the > > ndfs , it is > > > > theoretically > > > > unlimited . > > > > > > > > Is my above mentioned point true ( In a nutshell > > , i want to know > > > > whether > > > > the only thing is the space for storing the > > nutch indexed date) > > > > > > > > I will try to do some testing and if possible > > contribute to wiki. > > > > > > > > Rgds > > > > > > > > Prabhu > > > > > > > > > > > > > > > > On 1/30/06, Stefan Groschupf > > <[EMAIL PROTECTED]> wrote: > > > >> > > > >> Any performance testing contribution to the > > wiki is welcome, I > > > >> guess. :) > > > >> So there are no such values except of some > > statements regarding > > > >> search speed in the wiki, > > > >> With nutch .8 there theoretically no size limit > > any more. > > > >> Stefan > > > >> Am 29.01.2006 um 13:35 schrieb Raghavendra > > Prabhu: > > > >> > > > >>> Is there any benchmark on how nutch performs > > > >>> > > > >>> I mean say like 1 GB of data is given as the > > input , how much time > > > >>> it will > > > >>> take to index this data on a 10 Mb/s network > > > >>> > > > >>> And while doing crawl what is the volume of > > data which can be > > > >>> loaded (this > > > >>> is in terms of search . How much can a crawl > > segment hold ) > > > >>> > > > >>> Is there any performance limit to it . Is the > > criteria only the > > > >>> space to > > > >>> store the indexed data .Is there any limit to > > it ? > > > >>> > > > >>> Rgds > > > >>> > > > >>> Prabhu > > > >> > > > >> > > > --------------------------------------------------------------- > > > >> company: http://www.media-style.com > > > >> forum: http://www.text-mining.org > > > >> blog: http://www.find23.net > > > >> > > > >> > > > >> > > > >> > > > > > > > > > >
