comment this line in -[...@=] in crawl-urlfilter.txt
Alex.
-----Original Message-----
From: MyD <myd.ro...@googlemail.com>
To: nutch-user@lucene.apache.org
Sent: Thu, 19 Mar 2009 6:14 am
Subject: Re: Nutch doesn't find all urls.. Any suggestion?
I may have to say that in the html source code it is a relative url like
(/cfp/call?conference=artificial%20intelligence&page=2)
Regards,
MyD
MyD wrote:
>
> Hi @ all,
>
> I'd like to run an intranet crawl with my own plugin on the domain
> www.wikicfp.com.
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&skip=1)
>
> The problem is that nutch doesn't find the important urls, so nutch can't
> crawl further...
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=2)
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=3)
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=4)
> (http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page=
> ....)
>
> Any suggestions?
>
> nutch-site.xml
>
> <property>
> <name>plugin.includes</name>
> <value>my-plugin|protocol-http|parse-(html|js)|index-basic</value>
> <description>
> </description>
> </property>
>
> I commented all urlfilter files (regex etc..) in conf/.
>
> Thanks in advance.
>
> Regards,
> MyD
>
>
--
View this message in context:
http://www.nabble.com/Nutch-doesn%27t-find-all-urls..-Any-suggestion--tp22599690p22599904.html
Sent from the Nutch - User mailing list archive at Nabble.com.