hi, the commands "readdb" and "readlinkdb" could be interesting for you: http://wiki.apache.org/nutch/08CommandLineOptions
If you want to see the in/outlinks (readlinkdb) of a given page you must fist invoke the "invertlinks" command. Unfortunatly, I don't know how to remove an individual url from a crawldb . . . sorry Hope it helps, Martin On 9/11/07, Jeff Van Boxtel <[EMAIL PROTECTED]> wrote: > > I am experiencing a problem where my fetcher is trying to grab lots of > URLs that don't exist. For example it will try to get: > > fetching http://www.ourhost.com/project_files/PROJECTS/000260/WP/ > 0L19MM14.doc > /0k07mm10.doc/%200L19MM14.doc/0i13mm4.doc/0I29MM3.PDF/%200L19MM14.doc/ > > There is no such url that exists and I can't figure out where the > crawler is getting these strange urls from. I don't think any of my > pages link to something like this. I have also seen other (less bizarre) > urls that don't seem to exist and there are no links to them anywhere on > our site. Is it possible that the crawldb is getting corrupt? Is there a > way I can see where the crawldb got these URLs from? And if the urls > result in a 404 page is there a way to have them removed from the > crawldb? >
