Thanks. I'll do a bit more research on the subject. Believe it or not, I've never heard of DOMContentUtils.getOutlinks, but I'm learning a bit more and more every day :).
Step 1 - install it Step 2 - ask a bunch of annoying newbie questions on the mailing list Step 3 - RTFM Step 4 - fix my installation so that it works to my needs Step 5 - answer annoying newbie questions from other people and help spread the word. I'm still on step 2, but I'm getting there :) Thanks, Steve http://www.stevekallestad.com/ On 2/8/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
Hi Steve, Steve Kallestad wrote: > I've discovered that nutch follows links that aren't necessarily links - > > in my MediaWiki implementation, there is some out-of-the-box > javascript that contains: > > var wgArticlePath = "/wiki/$1"; > > Nutch actually tries to go to /wiki/$1. I've eliminated this > particular problem by adding -[$] to my url-crawlfilters.txt file, You also might want to disable the parse-js parser if you don't need it (in your nutch-site.xml)... > but > I can't imagine that this is the only time this kind of problem will > pop up. I'm wondering if there isn't a way to ensure that all links > start with one of: > href=" > href = " > href=" > href =" > > I'm a little shy about trying to implement such a filter without any > advice. Does anyone have any thoughts on how to build such a filter > into nutch? This is taken care of by DOMContentUtils.getOutlinks: it relies on DOM attributes, not Strings > > Right now, I'm just doing site-search which means this isn't that big > a problem. But I'm concerned about implementing a wider ranging > search index without having a resolution to this problem - I'd hate > for my spider to be grabbing a bunch of unlinked 404's. > > Also - does nutch follow rel="nofollow" links out of the box? > > I imagine that it respects robots.txt, but I thought I'd ask about > that one too, just to be safe - I'm a newbie after all :) DOMContentUtils.getOutlinks ignores rel="nofollow" links HTH, Renaud -- Renaud Richardet +1 617 230 9112 my email is my first name at apache.org http://www.oslutions.com
