Re: [Nutch-general] Fetching outside the domain ?

qi wu Thu, 19 Apr 2007 01:49:12 -0700

I find there is a bug in Fetcher,which  cause the problem you reported...
Now,Nutch only take external link check during the parsing process,which can 
make sure all the outlinks generated are in the same host  as the from-URL.But 
for the links which  will be redirected during fetch,this is not enough.we also 
need to make sure the redirected url is are in the same host with in the source 
URL.
Just take the link below as an example: 
http://www.nxtravel.net/?feed=AS&template=Lander_Hybrid&rank=4&keyword=Loans&d=unsecured-direct-loan.com&rid=http%3A%2F%2Fwww.google.com%2Furl%3Fsa%3DL%26ai%3DBLo7nXConRq6MG5_IhQS6xtEClJquHNzjjKMGrOuW0wTAuAIQBBgEIInKzAcoBzABOAFQ0PfZ2vj_____AWCdudCBkAWYAeeHAZgBhogBqgEFMDI1MTSyAQxueHRyYXZlbC5uZXTIAQHaAQxueHRyYXZlbC5uZXTIApS06QHZAzr5xMjNnhl44AMC%26num%3D4%26q%3Dhttp%3A%2F%2Funsecured-direct-loan.com%2Funsecured-loans-online.html%26usg%3DAFrqEzct1VSZnZ48RrXOwHNyxS8qzm9O_w
 
it will be redirected to 
http://unsecured-direct-loan.com/unsecured-loans-online.html


and this new redirected URL, will be added  fetch queue....
So just add external link check for moved and temp_moved urls should fix this 
problem.

----- Original Message ----- 
From: "Tomi N/A" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, April 18, 2007 6:40 PM
Subject: Re: Fetching outside the domain ?


> Continuing the noble tradition of replying to my own messages, I have
> a small update on the topic of the crawler crawling outside of the
> given list of hosts in spite of db.ignore.external.links=true...
> 
> 2006/10/25, Tomi NA <[EMAIL PROTECTED]>:
> 
>> > Could you give an example of a root URL, which leads to this symptom
>> > (i.e. leaks outside the original site)?
>>
>> I'll try to find out exactly where the crawler starts to run loose as
>> I have several web sites in my initial URL list.
> 
> I'm using nutch 0.9 now and have run into the problem again. It's a
> bit hard to reproduce as I have dozens of hosts in my initial URL list
> and the crawler leaves them days after I start the crawl: it's very
> difficult to pinpoint how or why the crawler steps outside it's
> bounds.
> 
> Did anyone else run into such a problem?
> Is there anything else I needed to do set up except
> db.ignore.external.links=true?
> 
> TIA,
> t.n.a.
>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Fetching outside the domain ?

Reply via email to