On 10/26/06, AJ Chen <[EMAIL PROTECTED]> wrote:
Current version of nutch uses mapred regardless of the number of computer
nodes. So, for applications using a single computer and default
configuration (i.e. no distributed crawling), the issue is not about
performance gain from mapred, but rather how to minimize the overhead from
mapred. Does anyone have a good performance benchmark for running nutch 0.8or
0.9 on a single machine? Particularly, how much time spent in map-reduce
phases is reasonable relative to the time used by fetching phase?

Can anyone tell whether 4 hours of doing "reduce > reduce" after fetching
100,000 pages in 5 hours is within the expectation? If it's not right, what
might be the cause?

How much memory do you have? Also whats the java heap size? I have the
following configuration on my test server

6 GB memory
one 64 AMD
Java heap 4 GB
and I can do 250,000 pages.. crawl to index in about 2-3 hours.

also I am just fetching strict html pages ..no pdf no word etc..

Sorry its very difficult to say what could be the problem.

Regards,
Zaheed


AJ


On 10/26/06, Josef Novak <[EMAIL PROTECTED]> wrote:
>
> Hi AJ,
>
> I very well may be wrong, but as I understand it, nutch/hadoop
> implements map/reduce primarily as a means of efficiently and reliably
> distributing work among nodes in a (large) cluster of consumer grade
> machines.  I suspect that there is not much to be gained from
> implementing it with a single machine.
>
> http://labs.google.com/papers/mapreduce.html
> http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
> http://wiki.apache.org/lucene-hadoop/
>
>
> happy hunting,
> joe
>
>
> On 10/27/06, AJ Chen <[EMAIL PROTECTED]> wrote:
> > I'm using 0.9-dev code to crawl the web on a single machine. Using
> default
> > configuration, it spends ~5 hours to fetch 100,000 pages, but also >5
> hours
> > in doing map-reduce. Is this the expected performance for map-reduce
> phase
> > relative to fetch phase? It seems to me map-reduce takes too much time.
> Is
> > there anything to configure in order to reduce the operation (time) for
> > map-reduce?  I'll appreciate any suggetion on how to improve web search
> > performance on single machine.
> >
> > Thanks,
> >
> > AJ
> > http://web2express.org
> >
> >
>



--
AJ Chen, PhD
http://web2express.org


Reply via email to