Hi, I checked hadoop log file and found there are many urls are similar like follows:
http://target-domain-name/whitepapers/text/:/\x7F/1.6 http://target-domain-name/whitepapers/text/:/\x7F/8.0.2 http://target-domain-name/whitepapers/text/:/\x7F/WT.ti] http://target-domain-name/whitepapers/text/firefox/1.1 http://target-domain-name/whitepapers/text/firefox/:/\x7F/g, http://target-domain-name/whitepapers/text/text/DCS.dcsref http://target-domain-name/:/\x7F/firefox/firefox/0. target-domain-name here refers the website I am ,I think this is a reason why I got high un-fetched url ratio. I do not understand why those urls are included. Anyone can offer me idea to improve url quality? Thanks Ian ianwong wrote: > > greeting, > > I wanna know how to make cralwer fetch pages as much as it can? I want to > downsize the unfetched pages number. > > In the first time, I used default value 100 for db.max.outlinks.per.page. > I noticed there are lots of unfetched url compare to fetched url.. about > 15000 against 3000 > > Then, I tried to set db.max.outlinks.per.page to -1, but the rate is > similar to 15000 vs 3000. > > I also tried set it to 0, nothing is fetched. > > Thanks > > Ian > -- View this message in context: http://www.nabble.com/about-unfetched-links-tp20851584p20875920.html Sent from the Nutch - User mailing list archive at Nabble.com.
