comment this line in -[...@=] in crawl-urlfilter.txt

Alex.


 

-----Original Message-----
From: MyD <myd.ro...@googlemail.com>
To: nutch-user@lucene.apache.org
Sent: Thu, 19 Mar 2009 6:14 am
Subject: Re: Nutch doesn't find all urls.. Any suggestion?











I may have to say that in the html source code it is a relative url like
(/cfp/call?conference=artificial%20intelligence&page=2)

Regards,
MyD


MyD wrote:
> 
> Hi @ all,
> 
> I'd like to run an intranet crawl with my own plugin on the domain
> www.wikicfp.com.
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&skip=1)
> 
> The problem is that nutch doesn't find the important urls, so nutch can't
> crawl further...
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=2)
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=3)
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=4)
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=
> ....)
> 
> Any suggestions?
> 
> nutch-site.xml
> 
> <property>
>   <name>plugin.includes</name>
>   <value>my-plugin|protocol-http|parse-(html|js)|index-basic</value>
>   <description>
>   </description>
> </property>
> 
> I commented all urlfilter files (regex etc..) in conf/.
> 
> Thanks in advance.
> 
> Regards,
> MyD
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-doesn%27t-find-all-urls..-Any-suggestion--tp22599690p22599904.html
Sent from the Nutch - User mailing list archive at Nabble.com.




 

Reply via email to