Hi Steve,
Steve Kallestad wrote:
I've discovered that nutch follows links that aren't necessarily links -
in my MediaWiki implementation, there is some out-of-the-box
javascript that contains:
var wgArticlePath = "/wiki/$1";
Nutch actually tries to go to /wiki/$1. I've eliminated this
particular problem by adding -[$] to my url-crawlfilters.txt file,
You also might want to disable the parse-js parser if you don't need it
(in your nutch-site.xml)...
but
I can't imagine that this is the only time this kind of problem will
pop up. I'm wondering if there isn't a way to ensure that all links
start with one of:
href="
href = "
href="
href ="
I'm a little shy about trying to implement such a filter without any
advice. Does anyone have any thoughts on how to build such a filter
into nutch?
This is taken care of by DOMContentUtils.getOutlinks: it relies on DOM
attributes, not Strings
Right now, I'm just doing site-search which means this isn't that big
a problem. But I'm concerned about implementing a wider ranging
search index without having a resolution to this problem - I'd hate
for my spider to be grabbing a bunch of unlinked 404's.
Also - does nutch follow rel="nofollow" links out of the box?
I imagine that it respects robots.txt, but I thought I'd ask about
that one too, just to be safe - I'm a newbie after all :)
DOMContentUtils.getOutlinks ignores rel="nofollow" links
HTH,
Renaud
--
Renaud Richardet +1 617 230 9112
my email is my first name at apache.org http://www.oslutions.com