Hi all,
I need some clarification on the relationship between cached content and the
ability for nutch to display a search summary. In particular, does nutch
require the content to be cached to be able to index? Reason is if you look at
the below query, you will notice that the following hit shows up. Note there
is NO search summary for this site and neither is the cached content available.
However nutch must have indexed the site's content at some prior stage.
Question is when and how if the content isn't available (i.e. was it available
at the time nutch indexed?) Secondly, how can I prevent nutch from displaying
such results (I need ALL search results to have a summary therefore if there is
a problem with the cached content I dont want the item returned)..
Jamaica Hotels Motels - a travel and top tourism guide to Jamaica
http://www.jamaicahotelsmotels.com/ (cached) (explain) (anchors)
(more from
www.jamaicahotelsmotels.com)
test URL:
http://67.192.122.70:8180/nutch-svn/search.jsp?query=Jamaica+Hotels+Motels+-+a+travel+and+top+tourism+guide+to+Jamaica&hitsPerPage=10&lang=en
In case someone makes the observation that nutch recognizes this site as a hit
because the query term matches the page title, while this may be the case for
this specific query, the query used which highlighted this problem was simply
'dominica' which is NOT part of the page title, nor page URL.
Thanks in advance,
Hilkiah G. Lavinier MEng (Hons), ACGI
6 Winston Lane,
Goodwill,
Roseau, Dominica
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201 / AOL hilkiah21
----- Original Message ----
From: Bradford Stephens <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, April 9, 2008 7:29:07 PM
Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists:
text/xml
Thanks for keeping the help coming!
I had about 10 URLs/distinct hosts in my initial list. I *think* I'm
using Fetcher -- it's whatever comes with the 0.9 trunk that I checked
out. Is Fetcher2 faster?
I did not turn off parsing during fetching explicitly, I used whatever
setting was the default.
I did set 3 threads per host, but we're only running through a few T3s
here, I didn't think I was overwhelming them. I'm not sure about the
delay, I think that was the default as well.
On Tue, Apr 8, 2008 at 11:11 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Brad, "Nutch Speed Improvements" would be great.
>
> Regarding your changes - by setting "3 threads per host" things should go
> faster indeed, but aren't you being "inpolite"?
>
> How many URLs and how many distinct hosts did you have in your fetchlist?
> Did you use Fetcher or Fetcher2?
> Did you turn off parsing during fetching?
> What was the setting for the delay between subsequent requests to the same
> server? (ah, probably doesn't matter if ou let 3 threads hit the same server
> concurrently)
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
>
> From: Bradford Stephens <[EMAIL PROTECTED]>
> To: [email protected]
>
>
> Sent: Monday, April 7, 2008 12:52:56 PM
> Subject: Re: Slow Crawl Speed and Tika Error Media type alias already
> exists: text/xml
>
> Greetings again,
>
> Just wanted to let you know that I did increase the threads to 400 per
> server, and 3 per host. I was seeing about 15 pages/second. I didn't
> get a chance to implement the other suggestions because I'll eat all
> of the office's bandwidth and get yelled at :)
>
> Maybe I'll make a "Nutch Speed Improvements" entry in the Wiki.
>
> Cheers,
> Bradford Stephens
>
> On Sun, Apr 6, 2008 at 10:06 PM, Otis Gospodnetic
> <[EMAIL PROTECTED]> wrote:
> > Regarding the Tika error message, I've seen that, too..... if you need
> motivation, Chris. :)
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > From: Chris Mattmann <[EMAIL PROTECTED]>
> > To: [email protected]
> > Sent: Saturday, April 5, 2008 2:58:33 AM
> > Subject: Re: Slow Crawl Speed and Tika Error Media type alias already
> exists: text/xml
> >
> > Hi Bradford,
> >
> > > I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
> > > multiple T-3 line. Although it works fine, the fetch portion of the
> > > crawls seems to be awfully slow. The status message at one point is
> > > "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
> > > second seems to be awfully slow, given the environment I'm in. Is it a
> > > configuration issue? I'm using 200 threads per fetcher. I've also
> > > tried only 10 threads :)
> >
> > There are other parameters that control the speed of the fetch. What is
> your
> > value for speculative execution? I remember seeing something on the list
> > that this should parameter should be turned off to optimize fetch speed.
> > Give that a try, and let me know how it works out.
> >
> > > I'm also seeing my hadoop.logs rapidly filled with the error message
> > > mentioned in [NUTCH-618], which states:
> > >
> > > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
> > > Invalid media type alias: text/xml
> > > org.apache.tika.mime.MimeTypeException: Media type alias already
> > > exists: text/xml
> > >
> > > Is this impacting the performance? I've tried removing
> > > conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
> > > resolve the error message.
> >
> > Though definitely annoying I am fairly sure it's not directly affecting
> your
> > performance since the message is a simple WARNING that a media type
> detected
> > has been added multiple times to the time mime types registry. I certainly
> > need to address this issue though, so thanks for giving me some
> motivation.
> >
> > Let me know what the results of the speculative execution adjustment is.
> > Also, it may help to vocalize (here on the list) any other configuration
> > adjustments you have (or will have) made.
> >
> > HTH,
> > Chris
> >
> > >
> > > Much thanks in advance :)
> > >
> > > Cheers,
> > > Bradford
>
>
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com