Re: Crawler fetching weird urls

Martin Kuen Tue, 11 Sep 2007 17:05:22 -0700

hi,

the commands "readdb" and "readlinkdb" could be interesting for you:
http://wiki.apache.org/nutch/08CommandLineOptions


If you want to see the in/outlinks (readlinkdb) of a given page you must
fist invoke the "invertlinks" command.

Unfortunatly, I don't know how to remove an individual url from a crawldb .
. . sorry


Hope it helps,

Martin


On 9/11/07, Jeff Van Boxtel <[EMAIL PROTECTED]> wrote:
>
> I am experiencing a problem where my fetcher is trying to grab lots of
> URLs that don't exist. For example it will try to get:
>
> fetching http://www.ourhost.com/project_files/PROJECTS/000260/WP/
> 0L19MM14.doc
> /0k07mm10.doc/%200L19MM14.doc/0i13mm4.doc/0I29MM3.PDF/%200L19MM14.doc/
>
> There is no such url that exists and I can't figure out where the
> crawler is getting these strange urls from. I don't think any of my
> pages link to something like this. I have also seen other (less bizarre)
> urls that don't seem to exist and there are no links to them anywhere on
> our site. Is it possible that the crawldb is getting corrupt? Is there a
> way I can see where the crawldb got these URLs from? And if the urls
> result in a 404 page is there a way to have them removed from the
> crawldb?
>

Re: Crawler fetching weird urls

Reply via email to