Re: Nutch doesn't dive deeper

Michael Wechner Fri, 25 Aug 2006 07:13:07 -0700

I think the problem is as follows with XHTML files:

2006-08-25 16:06:11,925 WARN parse.ParserFactory -ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped tocontentType application/xhtml+xml via parse-plugins.xml, but itsplugin.xml file does not claim to support contentType: application/xhtml+xml

2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: xmlns
       at java.net.URL.<init>(URL.java:544)
       at java.net.URL.<init>(URL.java:434)
       at java.net.URL.<init>(URL.java:383)



whereas maybe this could be resolved with

http://issues.apache.org/jira/browse/NUTCH-359

I am kind of suprised that nobody else is having this problem withproper XHTML ;-)


Thanks

Michi

Ken Gregoire wrote:

look here, it is blocking robots: http://ulysses.wyona.org/robots.txt

User-agent: *
Disallow: /foo/bar.html

User-agent: lenya
Disallow: /foo/bar.html





Michael Wechner wrote:
Hi
I am trying to index http://ulysses.wyona.org/ but somehow it justindexes the homepage but doesn't seem to followany links. I have set "depth 3" and other sites are being crawleddeeper without a problem but not the Ulysses page.
Has anyone made similar experiences?
Is it possible that Nutch has problem with well-formed XHTML(application/xhtml+xml)?
Thanks

Michi



--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61

Re: Nutch doesn't dive deeper

Reply via email to