Hi Steve,
We are all learning :-), and your question was not annoying at all. Keep
asking!
Cheers,
Renaud
Steve Kallestad wrote:
Thanks. I'll do a bit more research on the subject. Believe it or
not, I've never heard of DOMContentUtils.getOutlinks, but I'm learning
a bit more and more every day :).
Step 1 - install it
Step 2 - ask a bunch of annoying newbie questions on the mailing list
Step 3 - RTFM
Step 4 - fix my installation so that it works to my needs
Step 5 - answer annoying newbie questions from other people and help
spread the word.
I'm still on step 2, but I'm getting there :)
Thanks,
Steve
http://www.stevekallestad.com/
On 2/8/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
Hi Steve,
Steve Kallestad wrote:
> I've discovered that nutch follows links that aren't necessarily
links -
>
> in my MediaWiki implementation, there is some out-of-the-box
> javascript that contains:
>
> var wgArticlePath = "/wiki/$1";
>
> Nutch actually tries to go to /wiki/$1. I've eliminated this
> particular problem by adding -[$] to my url-crawlfilters.txt file,
You also might want to disable the parse-js parser if you don't need it
(in your nutch-site.xml)...
> but
> I can't imagine that this is the only time this kind of problem will
> pop up. I'm wondering if there isn't a way to ensure that all links
> start with one of:
> href="
> href = "
> href="
> href ="
>
> I'm a little shy about trying to implement such a filter without any
> advice. Does anyone have any thoughts on how to build such a filter
> into nutch?
This is taken care of by DOMContentUtils.getOutlinks: it relies on DOM
attributes, not Strings
>
> Right now, I'm just doing site-search which means this isn't that big
> a problem. But I'm concerned about implementing a wider ranging
> search index without having a resolution to this problem - I'd hate
> for my spider to be grabbing a bunch of unlinked 404's.
>
> Also - does nutch follow rel="nofollow" links out of the box?
>
> I imagine that it respects robots.txt, but I thought I'd ask about
> that one too, just to be safe - I'm a newbie after all :)
DOMContentUtils.getOutlinks ignores rel="nofollow" links
HTH,
Renaud