Hi,
That's a great idea. As I would really like to know, which set-screws
you did tinker with.
thanks,
Sebastian Steinmetz
Am 07.04.2008 um 18:52 schrieb Bradford Stephens:
Greetings again,
Just wanted to let you know that I did increase the threads to 400 per
server, and 3 per host. I was seeing about 15 pages/second. I didn't
get a chance to implement the other suggestions because I'll eat all
of the office's bandwidth and get yelled at :)
Maybe I'll make a "Nutch Speed Improvements" entry in the Wiki.
Cheers,
Bradford Stephens
On Sun, Apr 6, 2008 at 10:06 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
Regarding the Tika error message, I've seen that, too..... if you
need motivation, Chris. :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Chris Mattmann <[EMAIL PROTECTED]>
To: [email protected]
Sent: Saturday, April 5, 2008 2:58:33 AM
Subject: Re: Slow Crawl Speed and Tika Error Media type alias
already exists: text/xml
Hi Bradford,
I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected
to a
multiple T-3 line. Although it works fine, the fetch portion of the
crawls seems to be awfully slow. The status message at one point is
"157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
second seems to be awfully slow, given the environment I'm in. Is
it a
configuration issue? I'm using 200 threads per fetcher. I've also
tried only 10 threads :)
There are other parameters that control the speed of the fetch.
What is your
value for speculative execution? I remember seeing something on the
list
that this should parameter should be turned off to optimize fetch
speed.
Give that a try, and let me know how it works out.
I'm also seeing my hadoop.logs rapidly filled with the error message
mentioned in [NUTCH-618], which states:
2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
Invalid media type alias: text/xml
org.apache.tika.mime.MimeTypeException: Media type alias already
exists: text/xml
Is this impacting the performance? I've tried removing
conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
resolve the error message.
Though definitely annoying I am fairly sure it's not directly
affecting your
performance since the message is a simple WARNING that a media type
detected
has been added multiple times to the time mime types registry. I
certainly
need to address this issue though, so thanks for giving me some
motivation.
Let me know what the results of the speculative execution
adjustment is.
Also, it may help to vocalize (here on the list) any other
configuration
adjustments you have (or will have) made.
HTH,
Chris
Much thanks in advance :)
Cheers,
Bradford
______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory Pasadena, CA
Office: 171-266B Mailstop: 171-246
_______________________________________________________
Disclaimer: The opinions presented within are my own and do not
reflect
those of either NASA, JPL, or the California Institute of Technology.