I've discovered that nutch follows links that aren't necessarily links -

in my MediaWiki implementation, there is some out-of-the-box
javascript that contains:

var wgArticlePath = "/wiki/$1";

Nutch actually tries to go to /wiki/$1.  I've eliminated this
particular problem by adding -[$] to my url-crawlfilters.txt file, but
I can't imagine that this is the only time this kind of problem will
pop up.  I'm wondering if there isn't a way to ensure that all links
start with one of:
href="
href = "
href="
href ="

I'm a little shy about trying to implement such a filter without any
advice.  Does anyone have any thoughts on how to build such a filter
into nutch?

Right now, I'm just doing site-search which means this isn't that big
a problem.  But I'm concerned about implementing a wider ranging
search index without having a resolution to this problem - I'd hate
for my spider to be grabbing a bunch of unlinked 404's.

Also - does nutch follow rel="nofollow" links out of the box?

I imagine that it respects robots.txt, but I thought I'd ask about
that one too, just to be safe - I'm a newbie after all :)

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to