ok I found finally that 

- even if content-type was "text/html", nutch suggest "text/xml" because of
".xml" file extention
- and parse-plugin.xml was calling parse-text for mimeType "text/xml" (now
parse-html, as in patch NUTCH-418)

so I solved my problem, is there no danger to use parse-html to parse XHTML
content (since i didn't see specific xhtml parser) ?



cybercouf wrote:
> 
> I saw the jira report about this problem (bug NUTCH-275), and applied the
> same configuration, but it's still not working.
> 
> mime-types.xml
> ---------------------
>     <mime-type name="text/xml"
>                description="Extensible Markup Language File">
>         <ext>xml</ext><ext>xsl</ext>
>         <!--magic offset="0" value="&lt;?xml"/-->
>     </mime-type>
> 
> nutch-default.xml
> ------------------------
> <name>mime.type.magic</name>
>   <value>false</value>
> 
> nutch-site.xml
> --------------------
> <name>mime.type.magic</name>
>   <value>false</value>
>  [...]
> <name>plugin.includes</name>
>     <value>parse-(text|html|rss [...]
> 
> 
> the target webpage is like:
> <?xml version="1.0"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";><html>
>   <head>
> 
> so nutch parse it using parse-text plugin, so no outlinks...
> 
> hadoop.log
> ----------------
> 2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,702 DEBUG parse.ParseUtil - Parsing
> [http://bmw.mobi/bmw/mobi/handler/0/nn/idx.xml] with
> [EMAIL PROTECTED]
> 2007-03-05 18:10:41,734 ERROR parse.OutlinkExtractor - getOutlinks
> java.net.MalformedURLException: unknown protocol: font-family
> 
> and in the segment dump i can see:
> Outlinks: 1
>   outlink: toUrl: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
> anchor: 
> 
> 
> reading the jira report the bug should be fixed, so what's wrong with me?
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-0.8.1-not-parsing-XHTML-using-XML-%28even-mime.type.magic-off%29-tf3350710.html#a9335478
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to