Thank you Andrzej, it helped !

Another problem i am facing right now is limiting the total number of urls
to crawl on a single website.
"generate.max.per.host" value doesn't seem to work as it is supposed to -
the value is set to 300,
however the total number of crawled urls varies from 600 to ~3000 depending
from "crawl.link.depth" value.

Is there any way to limit total number of urls per site to crawl (under
Nutch 0.8) ?



Andrzej Bialecki wrote:
> 
> [EMAIL PROTECTED] wrote:
>> I am using Nutch 0.8 to crawl a list of websites and i have found out
>> that
>> Nutch cannot find all the links on a page.
>>
>> For example: http://www.artbrown.com/
>>
>> According to google this website has approximately 4,450 pages.
>> However no matter how i change nutch's config, it won't crawl more
>> than 9 pages.
>>
>> I've tried changing "crawl.link.depth", "http.content.limit"
>> and using Tagsoup html parser instead of NekoHtml as described here:
>> http://www.mail-archive.com/[email protected]/msg03141.html
>> but it doesn't help.
>>
>> Any ideas ?
>>   
> 
> Check your url filters - most likely you have the default rule that 
> discards URLs with special characters.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-0.8-cannot-find-all-the-links-on-a-page-tf3033338.html#a8446081
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to