[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12415116 ]
Jerome Charron commented on NUTCH-275: -------------------------------------- > could it be extended to differentiate xml and xhtml Yes, I have a new version based on freedesktop specification that is sleeping for a while on my disk. I don't want to commit it before the 0.8-release... probably for the 0.9 This version has a better handling for xml / xhtml/ html related documents. For now, I think the best solution is to remove the magic detection for xml , simply by removing the <magic offset="0" ... > line for xml content type in mime-types.xml > Fetcher not parsing XHTML-pages at all > -------------------------------------- > > Key: NUTCH-275 > URL: http://issues.apache.org/jira/browse/NUTCH-275 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Environment: problem with nightly-2006-05-20; worked fine with same website > on 0.7.2 > Reporter: Stefan Neufeind > > Server reports page as "text/html" - so I thought it would be processed as > html. > But something I guess evaluated the headers of the document and re-labeled it > as "text/xml" (why not text/xhtml?). > For some reason there is no plugin to be found for indexing text/xml (why > does TextParser not feel responsible?). > Links inside this document are NOT indexed at all - no digging this website > actually stops here. > Funny thing: For some magical reasons the dtd-files referenced in the header > seem to be valid links for the fetcher and as such are indexed in the next > round (if urlfilter allows). > 060521 025018 fetching http://www.secreturl.something/ > 060521 025018 http.proxy.host = null > 060521 025018 http.proxy.port = 8080 > 060521 025018 http.timeout = 10000 > 060521 025018 http.content.limit = 65536 > 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; > http://lucene.apache.org/nutch/bot.html; [email protected]) > 060521 025018 fetcher.server.delay = 1000 > 060521 025018 http.max.delays = 1000 > 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser > mapped to contentType text/xml via parse-plugins.xml, but > its plugin.xml file does not claim to support contentType: text/xml > 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser > mapped to contentType text/xml via parse-plugins.xml, but > not enabled via plugin.includes in nutch-default.xml > 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature > 060521 025019 map 0% reduce 0% > 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, > 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
