This is yet another side effect of applying TextParser to non plain text
documents and in this particular case it comes short with namespace
declarations. I propose that we remove the PlainText parser from at least
the following mime types:
* (default)
application/rss+xml
application/vnd.wap.wbxml
application/vnd.wap.wmlc
application/vnd.wap.wmlscriptc
application/xhtml+xml
application/x-latex
application/x-netcdf
application/x-tex
application/x-texinfo
application/x-troff
application/x-troff-man
application/x-troff-me
application/x-troff-ms
message/news
message/rfc822
text/css
text/sgml
text/vnd.wap.wml
text/xml
text/x-setext
I would guess that handling of text/xhtml+xml mimetpe should be done with
html parser anyway.
--
Sami Siren
2006/8/25, Michael Wechner <[EMAIL PROTECTED]>:
I think the problem is as follows with XHTML files:
2006-08-25 16:06:11,925 WARN parse.ParserFactory -
ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
contentType application/xhtml+xml via parse-plugins.xml, but its
plugin.xml file does not claim to support contentType:
application/xhtml+xml
2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: xmlns
at java.net.URL.<init>(URL.java:544)
at java.net.URL.<init>(URL.java:434)
at java.net.URL.<init>(URL.java:383)
whereas maybe this could be resolved with
http://issues.apache.org/jira/browse/NUTCH-359
I am kind of suprised that nobody else is having this problem with
proper XHTML ;-)
Thanks
Michi
Ken Gregoire wrote:
> look here, it is blocking robots: http://ulysses.wyona.org/robots.txt
>
> User-agent: *
> Disallow: /foo/bar.html
>
> User-agent: lenya
> Disallow: /foo/bar.html
>
>
>
>
>
> Michael Wechner wrote:
>
>> Hi
>>
>> I am trying to index http://ulysses.wyona.org/ but somehow it just
>> indexes the homepage but doesn't seem to follow
>> any links. I have set "depth 3" and other sites are being crawled
>> deeper without a problem but not the Ulysses page.
>>
>> Has anyone made similar experiences?
>>
>> Is it possible that Nutch has problem with well-formed XHTML
>> (application/xhtml+xml)?
>>
>> Thanks
>>
>> Michi
>>
>
--
Michael Wechner
Wyona - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED] [EMAIL PROTECTED]
+41 44 272 91 61
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general