ok I found finally that - even if content-type was "text/html", nutch suggest "text/xml" because of ".xml" file extention - and parse-plugin.xml was calling parse-text for mimeType "text/xml" (now parse-html, as in patch NUTCH-418)
so I solved my problem, is there no danger to use parse-html to parse XHTML content (since i didn't see specific xhtml parser) ? cybercouf wrote: > > I saw the jira report about this problem (bug NUTCH-275), and applied the > same configuration, but it's still not working. > > mime-types.xml > --------------------- > <mime-type name="text/xml" > description="Extensible Markup Language File"> > <ext>xml</ext><ext>xsl</ext> > <!--magic offset="0" value="<?xml"/--> > </mime-type> > > nutch-default.xml > ------------------------ > <name>mime.type.magic</name> > <value>false</value> > > nutch-site.xml > -------------------- > <name>mime.type.magic</name> > <value>false</value> > [...] > <name>plugin.includes</name> > <value>parse-(text|html|rss [...] > > > the target webpage is like: > <?xml version="1.0"?> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html> > <head> > > so nutch parse it using parse-text plugin, so no outlinks... > > hadoop.log > ---------------- > 2007-03-05 18:10:41,671 WARN parse.ParserFactory - ParserFactory:Plugin: > org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via > parse-plugins.xml, but its plugin.xml file does not claim to support > contentType: text/xml > 2007-03-05 18:10:41,671 WARN parse.ParserFactory - ParserFactory:Plugin: > org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via > parse-plugins.xml, but its plugin.xml file does not claim to support > contentType: text/xml > 2007-03-05 18:10:41,671 WARN parse.ParserFactory - ParserFactory:Plugin: > org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via > parse-plugins.xml, but its plugin.xml file does not claim to support > contentType: text/xml > 2007-03-05 18:10:41,702 DEBUG parse.ParseUtil - Parsing > [http://bmw.mobi/bmw/mobi/handler/0/nn/idx.xml] with > [EMAIL PROTECTED] > 2007-03-05 18:10:41,734 ERROR parse.OutlinkExtractor - getOutlinks > java.net.MalformedURLException: unknown protocol: font-family > > and in the segment dump i can see: > Outlinks: 1 > outlink: toUrl: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd > anchor: > > > reading the jira report the bug should be fixed, so what's wrong with me? > -- View this message in context: http://www.nabble.com/Nutch-0.8.1-not-parsing-XHTML-using-XML-%28even-mime.type.magic-off%29-tf3350710.html#a9335478 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
