I saw the jira report about this problem (bug NUTCH-275), and applied the
same configuration, but it's still not working.

mime-types.xml
---------------------
    <mime-type name="text/xml"
               description="Extensible Markup Language File">
        <ext>xml</ext><ext>xsl</ext>
        <!--magic offset="0" value="&lt;?xml"/-->
    </mime-type>

nutch-default.xml
------------------------
<name>mime.type.magic</name>
  <value>false</value>

nutch-site.xml
--------------------
<name>mime.type.magic</name>
  <value>false</value>
 [...]
<name>plugin.includes</name>
    <value>parse-(text|html|rss [...]


the target webpage is like:
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";><html>
  <head>

so nutch parse it using parse-text plugin, so no outlinks...

hadoop.log
----------------
2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via
parse-plugins.xml, but its plugin.xml file does not claim to support
contentType: text/xml
2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via
parse-plugins.xml, but its plugin.xml file does not claim to support
contentType: text/xml
2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
parse-plugins.xml, but its plugin.xml file does not claim to support
contentType: text/xml
2007-03-05 18:10:41,702 DEBUG parse.ParseUtil - Parsing
[http://bmw.mobi/bmw/mobi/handler/0/nn/idx.xml] with
[EMAIL PROTECTED]
2007-03-05 18:10:41,734 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: font-family

and in the segment dump i can see:
Outlinks: 1
  outlink: toUrl: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
anchor: 


reading the jira report the bug should be fixed, so what's wrong with me?
-- 
View this message in context: 
http://www.nabble.com/Nutch-0.8.1-not-parsing-XHTML-using-XML-%28even-mime.type.magic-off%29-tf3350710.html#a9317310
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to