Hi Sami, I'm not sure that I agree that the entire set of mime types that you list below should be removed from the parse-plugins.xml default mapping. For instance, if you look at the current mapping file, many of the types below would have no other option for parsing them besides the TextParser. I think it makes a lot of sense to parse some of the below documents with the TextParser because, in fact, they are text documents. A LaTeX document is a plan text document. Text/css is essentially a plain text document. An rfc822 message is indeed (stripped of headers), a plain text document.
There's a careful tradeoff that must be made in terms of having a default config file that allows the greatest coverage of mime tyeps that are available, and the handling of them with at least * one * parser, in contrast to not including any parser at all for a particular mime type. I struggled with this very issue when I initially created that file and what you see in there now represents a "best guess" of mime types mapped to the available parsers that exist in Nutch. The other option on that file is that people can modify it on their own. For instance, in a domain-specific deployment, a user can add and remove whatever mime type to plugin mappings she wants from the parse-plugins.xml file: it was never meant to be something that was "set in stone" per se. It would be good to see some experiments to see what the best config set for parse-plugins.xml is. Thanks! Cheers, Chris On 8/27/06 12:30 AM, "sami siren" <[EMAIL PROTECTED]> wrote: > This is yet another side effect of applying TextParser to non plain text > documents and in this particular case it comes short with namespace > declarations. I propose that we remove the PlainText parser from at least > the following mime types: > > * (default) > application/rss+xml > application/vnd.wap.wbxml > application/vnd.wap.wmlc > application/vnd.wap.wmlscriptc > application/xhtml+xml > application/x-latex > application/x-netcdf > application/x-tex > application/x-texinfo > application/x-troff > application/x-troff-man > application/x-troff-me > application/x-troff-ms > message/news > message/rfc822 > text/css > text/sgml > text/vnd.wap.wml > text/xml > text/x-setext > > I would guess that handling of text/xhtml+xml mimetpe should be done with > html parser anyway. > > -- > Sami Siren > > 2006/8/25, Michael Wechner <[EMAIL PROTECTED]>: >> >> I think the problem is as follows with XHTML files: >> >> 2006-08-25 16:06:11,925 WARN parse.ParserFactory - >> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to >> contentType application/xhtml+xml via parse-plugins.xml, but its >> plugin.xml file does not claim to support contentType: >> application/xhtml+xml >> 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks >> java.net.MalformedURLException: unknown protocol: xmlns >> at java.net.URL.<init>(URL.java:544) >> at java.net.URL.<init>(URL.java:434) >> at java.net.URL.<init>(URL.java:383) >> >> >> whereas maybe this could be resolved with >> >> http://issues.apache.org/jira/browse/NUTCH-359 >> >> I am kind of suprised that nobody else is having this problem with >> proper XHTML ;-) >> >> Thanks >> >> Michi >> >> Ken Gregoire wrote: >> >>> look here, it is blocking robots: http://ulysses.wyona.org/robots.txt >>> >>> User-agent: * >>> Disallow: /foo/bar.html >>> >>> User-agent: lenya >>> Disallow: /foo/bar.html >>> >>> >>> >>> >>> >>> Michael Wechner wrote: >>> >>>> Hi >>>> >>>> I am trying to index http://ulysses.wyona.org/ but somehow it just >>>> indexes the homepage but doesn't seem to follow >>>> any links. I have set "depth 3" and other sites are being crawled >>>> deeper without a problem but not the Ulysses page. >>>> >>>> Has anyone made similar experiences? >>>> >>>> Is it possible that Nutch has problem with well-formed XHTML >>>> (application/xhtml+xml)? >>>> >>>> Thanks >>>> >>>> Michi >>>> >>> >> >> >> -- >> Michael Wechner >> Wyona - Open Source Content Management - Apache Lenya >> http://www.wyona.com http://lenya.apache.org >> [EMAIL PROTECTED] [EMAIL PROTECTED] >> +41 44 272 91 61 >> >> ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
