[Nutch-general] Re: benchmark and performance

Raghavendra Prabhu Mon, 30 Jan 2006 22:56:33 -0800

Hi

Thanks . I got the explanation .So with mapreduce we will be able to process
crawldb efficiently



Rgds
Prabhu


On 1/31/06, Byron Miller <[EMAIL PROTECTED]> wrote:
>
> Prabhu,
>
> For nutch .7x the upper limit of webdb isn't
> necessarily file size but hardware/computation size.
> You basically need 210% of your webdb size to do any
> processing of it so if you have 100 million urls and a
> 1.5 terrabyte webdb you need (on the same server)  3.7
> terrabytes of disk space to process the webdb, do the
> updates, drop the tmp file and update your main webdb.
>
> For .8/mapred branch it breaks down the jobs to
> smaller map jobs and doesn't have one huge file that
> consumes everything. You can better spend what was
> wasted before distributed across more systems for
> redundancy and performance.
>
>
> Hopefully that makes sense.
>
> -byron
> --- Raghavendra Prabhu <[EMAIL PROTECTED]> wrote:
>
> > Hi Stefan
> >
> > Thanks for your mail
> >
> > What i would like to know is (since i am using
> > nutch-0.7) ,what is the upper
> > limit on the webdb size if any such limit exists in
> > nutch-0.7
> >
> > Will the generate for a web db formed from one TB of
> > data (just an example)
> > work ?
> >
> > And what is the  difference between webdb and
> > nutch-0.8 (crawldb and linkdb)
> > which makes it infinitely possible in nutch-0.8?
> >
> > Rgds
> > Prabhu
> >
> >
> >
> > On 1/30/06, Stefan Groschupf <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > You can already use ndfs in 0.7, however in case
> > the webdb is to
> > > lareg it took to much time to generate segments.
> > > So the problem is the webdb size, not the hdd
> > limit.
> > >
> > > Am 30.01.2006 um 07:31 schrieb Raghavendra Prabhu:
> > >
> > > > Hi Stefan
> > > >
> > > > So can i assume that hard disk space is the only
> > constraint in
> > > > nutch-0.7
> > > >
> > > > In nutch-0.8 since you can store it over the
> > ndfs , it is
> > > > theoretically
> > > > unlimited .
> > > >
> > > > Is my above mentioned point true ( In a nutshell
> > , i want to know
> > > > whether
> > > > the only thing is the space for storing the
> > nutch indexed date)
> > > >
> > > > I will try to do some testing and if possible
> > contribute to wiki.
> > > >
> > > > Rgds
> > > >
> > > > Prabhu
> > > >
> > > >
> > > >
> > > > On 1/30/06, Stefan Groschupf
> > <[EMAIL PROTECTED]> wrote:
> > > >>
> > > >> Any performance testing contribution to the
> > wiki is welcome, I
> > > >> guess. :)
> > > >> So there are no such values except of some
> > statements regarding
> > > >> search speed in the wiki,
> > > >> With nutch .8 there theoretically no size limit
> > any more.
> > > >> Stefan
> > > >> Am 29.01.2006 um 13:35 schrieb Raghavendra
> > Prabhu:
> > > >>
> > > >>> Is there any benchmark on how nutch performs
> > > >>>
> > > >>> I mean say like 1 GB of data is given as the
> > input , how much time
> > > >>> it will
> > > >>> take to index this data on a 10 Mb/s network
> > > >>>
> > > >>> And while doing crawl what is the volume of
> > data which can be
> > > >>> loaded (this
> > > >>> is in terms of search  . How much can a crawl
> > segment hold )
> > > >>>
> > > >>> Is there any performance limit to it . Is the
> > criteria only the
> > > >>> space to
> > > >>> store the indexed data .Is there any limit to
> > it ?
> > > >>>
> > > >>> Rgds
> > > >>>
> > > >>> Prabhu
> > > >>
> > > >>
> >
> ---------------------------------------------------------------
> > > >> company:        http://www.media-style.com
> > > >> forum:        http://www.text-mining.org
> > > >> blog:            http://www.find23.net
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
>
>

[Nutch-general] Re: benchmark and performance

Reply via email to