I've discovered that nutch follows links that aren't necessarily links - in my MediaWiki implementation, there is some out-of-the-box javascript that contains:
var wgArticlePath = "/wiki/$1"; Nutch actually tries to go to /wiki/$1. I've eliminated this particular problem by adding -[$] to my url-crawlfilters.txt file, but I can't imagine that this is the only time this kind of problem will pop up. I'm wondering if there isn't a way to ensure that all links start with one of: href=" href = " href=" href =" I'm a little shy about trying to implement such a filter without any advice. Does anyone have any thoughts on how to build such a filter into nutch? Right now, I'm just doing site-search which means this isn't that big a problem. But I'm concerned about implementing a wider ranging search index without having a resolution to this problem - I'd hate for my spider to be grabbing a bunch of unlinked 404's. Also - does nutch follow rel="nofollow" links out of the box? I imagine that it respects robots.txt, but I thought I'd ask about that one too, just to be safe - I'm a newbie after all :) ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
