Hi Bradford,

> I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
> multiple T-3 line. Although it works fine, the fetch portion of the
> crawls seems to be awfully slow. The status message at one point is
> "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
> second seems to be awfully slow, given the environment I'm in. Is it a
> configuration issue? I'm using 200 threads per fetcher. I've also
> tried only 10 threads :)

There are other parameters that control the speed of the fetch. What is your
value for speculative execution? I remember seeing something on the list
that this should parameter should be turned off to optimize fetch speed.
Give that a try, and let me know how it works out.

> I'm also seeing my hadoop.logs rapidly filled with the error message
> mentioned in [NUTCH-618], which states:
> 
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
> Invalid media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already
> exists: text/xml
> 
> Is this impacting the performance? I've tried removing
> conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
> resolve the error message.

Though definitely annoying I am fairly sure it's not directly affecting your
performance since the message is a simple WARNING that a media type detected
has been added multiple times to the time mime types registry. I certainly
need to address this issue though, so thanks for giving me some motivation.

Let me know what the results of the speculative execution adjustment is.
Also, it may help to vocalize (here on the list) any other configuration
adjustments you have (or will have) made.

HTH,
 Chris

> 
> Much thanks in advance :)
> 
> Cheers,
> Bradford

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to