I find there is a bug in Fetcher,which cause the problem you reported... Now,Nutch only take external link check during the parsing process,which can make sure all the outlinks generated are in the same host as the from-URL.But for the links which will be redirected during fetch,this is not enough.we also need to make sure the redirected url is are in the same host with in the source URL. Just take the link below as an example: http://www.nxtravel.net/?feed=AS&template=Lander_Hybrid&rank=4&keyword=Loans&d=unsecured-direct-loan.com&rid=http%3A%2F%2Fwww.google.com%2Furl%3Fsa%3DL%26ai%3DBLo7nXConRq6MG5_IhQS6xtEClJquHNzjjKMGrOuW0wTAuAIQBBgEIInKzAcoBzABOAFQ0PfZ2vj_____AWCdudCBkAWYAeeHAZgBhogBqgEFMDI1MTSyAQxueHRyYXZlbC5uZXTIAQHaAQxueHRyYXZlbC5uZXTIApS06QHZAzr5xMjNnhl44AMC%26num%3D4%26q%3Dhttp%3A%2F%2Funsecured-direct-loan.com%2Funsecured-loans-online.html%26usg%3DAFrqEzct1VSZnZ48RrXOwHNyxS8qzm9O_w it will be redirected to http://unsecured-direct-loan.com/unsecured-loans-online.html
and this new redirected URL, will be added fetch queue.... So just add external link check for moved and temp_moved urls should fix this problem. ----- Original Message ----- From: "Tomi N/A" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, April 18, 2007 6:40 PM Subject: Re: Fetching outside the domain ? > Continuing the noble tradition of replying to my own messages, I have > a small update on the topic of the crawler crawling outside of the > given list of hosts in spite of db.ignore.external.links=true... > > 2006/10/25, Tomi NA <[EMAIL PROTECTED]>: > >> > Could you give an example of a root URL, which leads to this symptom >> > (i.e. leaks outside the original site)? >> >> I'll try to find out exactly where the crawler starts to run loose as >> I have several web sites in my initial URL list. > > I'm using nutch 0.9 now and have run into the problem again. It's a > bit hard to reproduce as I have dozens of hosts in my initial URL list > and the crawler leaves them days after I start the crawl: it's very > difficult to pinpoint how or why the crawler steps outside it's > bounds. > > Did anyone else run into such a problem? > Is there anything else I needed to do set up except > db.ignore.external.links=true? > > TIA, > t.n.a. > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
