Hi Steve, We are all learning :-), and your question was not annoying at all. Keep asking!
Cheers, Renaud Steve Kallestad wrote: > Thanks. I'll do a bit more research on the subject. Believe it or > not, I've never heard of DOMContentUtils.getOutlinks, but I'm learning > a bit more and more every day :). > > Step 1 - install it > Step 2 - ask a bunch of annoying newbie questions on the mailing list > Step 3 - RTFM > Step 4 - fix my installation so that it works to my needs > Step 5 - answer annoying newbie questions from other people and help > spread the word. > > I'm still on step 2, but I'm getting there :) > > Thanks, > Steve > http://www.stevekallestad.com/ > > On 2/8/07, Renaud Richardet <[EMAIL PROTECTED]> wrote: >> Hi Steve, >> >> Steve Kallestad wrote: >> > I've discovered that nutch follows links that aren't necessarily >> links - >> > >> > in my MediaWiki implementation, there is some out-of-the-box >> > javascript that contains: >> > >> > var wgArticlePath = "/wiki/$1"; >> > >> > Nutch actually tries to go to /wiki/$1. I've eliminated this >> > particular problem by adding -[$] to my url-crawlfilters.txt file, >> You also might want to disable the parse-js parser if you don't need it >> (in your nutch-site.xml)... >> > but >> > I can't imagine that this is the only time this kind of problem will >> > pop up. I'm wondering if there isn't a way to ensure that all links >> > start with one of: >> > href=" >> > href = " >> > href=" >> > href =" >> > >> > I'm a little shy about trying to implement such a filter without any >> > advice. Does anyone have any thoughts on how to build such a filter >> > into nutch? >> This is taken care of by DOMContentUtils.getOutlinks: it relies on DOM >> attributes, not Strings >> > >> > Right now, I'm just doing site-search which means this isn't that big >> > a problem. But I'm concerned about implementing a wider ranging >> > search index without having a resolution to this problem - I'd hate >> > for my spider to be grabbing a bunch of unlinked 404's. >> > >> > Also - does nutch follow rel="nofollow" links out of the box? >> > >> > I imagine that it respects robots.txt, but I thought I'd ask about >> > that one too, just to be safe - I'm a newbie after all :) >> DOMContentUtils.getOutlinks ignores rel="nofollow" links >> >> HTH, >> Renaud ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
