Hi, 

I checked hadoop log file and found there are many urls are similar like
follows:

http://target-domain-name/whitepapers/text/:/\x7F/1.6

http://target-domain-name/whitepapers/text/:/\x7F/8.0.2

http://target-domain-name/whitepapers/text/:/\x7F/WT.ti]

http://target-domain-name/whitepapers/text/firefox/1.1

http://target-domain-name/whitepapers/text/firefox/:/\x7F/g,

http://target-domain-name/whitepapers/text/text/DCS.dcsref

http://target-domain-name/:/\x7F/firefox/firefox/0.

target-domain-name here refers the website I am ,I think this is a reason
why I got high un-fetched url ratio.  I do not understand why those urls are
included. Anyone can offer me idea to improve url quality?

Thanks

Ian




ianwong wrote:
> 
> greeting,
> 
> I wanna know how to make cralwer fetch pages as much as it can? I want to
> downsize the unfetched pages number.
> 
> In the first time, I used default value 100 for db.max.outlinks.per.page.
> I noticed there are lots of unfetched url compare to fetched url.. about
> 15000 against 3000
> 
> Then, I tried to set  db.max.outlinks.per.page to -1, but the rate is
> similar to 15000 vs 3000. 
> 
> I also tried set it to 0, nothing is fetched.
> 
> Thanks
> 
> Ian
> 

-- 
View this message in context: 
http://www.nabble.com/about-unfetched-links-tp20851584p20875920.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to