sami siren wrote: > This is yet another side effect of applying TextParser to non plain text > documents and in this particular case it comes short with namespace > declarations. I propose that we remove the PlainText parser from at least > the following mime types: > > * (default) > application/rss+xml > application/vnd.wap.wbxml > application/vnd.wap.wmlc > application/vnd.wap.wmlscriptc > application/xhtml+xml > application/x-latex > application/x-netcdf > application/x-tex > application/x-texinfo > application/x-troff > application/x-troff-man > application/x-troff-me > application/x-troff-ms > message/news > message/rfc822 > text/css > text/sgml > text/vnd.wap.wml > text/xml > text/x-setext > > I would guess that handling of text/xhtml+xml
I guess you mean application/xhtml+xml (as you actually note above) > mimetpe should be done with > html parser anyway. yes, I would say so Thanks Michi > > -- > Sami Siren > > 2006/8/25, Michael Wechner <[EMAIL PROTECTED]>: > >> >> I think the problem is as follows with XHTML files: >> >> 2006-08-25 16:06:11,925 WARN parse.ParserFactory - >> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to >> contentType application/xhtml+xml via parse-plugins.xml, but its >> plugin.xml file does not claim to support contentType: >> application/xhtml+xml >> 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks >> java.net.MalformedURLException: unknown protocol: xmlns >> at java.net.URL.<init>(URL.java:544) >> at java.net.URL.<init>(URL.java:434) >> at java.net.URL.<init>(URL.java:383) >> >> >> whereas maybe this could be resolved with >> >> http://issues.apache.org/jira/browse/NUTCH-359 >> >> I am kind of suprised that nobody else is having this problem with >> proper XHTML ;-) >> >> Thanks >> >> Michi >> >> Ken Gregoire wrote: >> >> > look here, it is blocking robots: http://ulysses.wyona.org/robots.txt >> > >> > User-agent: * >> > Disallow: /foo/bar.html >> > >> > User-agent: lenya >> > Disallow: /foo/bar.html >> > >> > >> > >> > >> > >> > Michael Wechner wrote: >> > >> >> Hi >> >> >> >> I am trying to index http://ulysses.wyona.org/ but somehow it just >> >> indexes the homepage but doesn't seem to follow >> >> any links. I have set "depth 3" and other sites are being crawled >> >> deeper without a problem but not the Ulysses page. >> >> >> >> Has anyone made similar experiences? >> >> >> >> Is it possible that Nutch has problem with well-formed XHTML >> >> (application/xhtml+xml)? >> >> >> >> Thanks >> >> >> >> Michi >> >> >> > >> >> >> -- >> Michael Wechner >> Wyona - Open Source Content Management - Apache Lenya >> http://www.wyona.com http://lenya.apache.org >> [EMAIL PROTECTED] [EMAIL PROTECTED] >> +41 44 272 91 61 >> >> > -- Michael Wechner Wyona - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED] [EMAIL PROTECTED] +41 44 272 91 61 ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
