Jérôme,

>>              Why should Nutch treat it as HTML? 
>
>       Simply because it is a HTML file, with a strange name, of course, but 
> it is a HTML file.
>       My example is a kind of "caricature". But some more real case could be 
> : a HTML file with a text/plain content-type, or with an text/xml 

These cases don't sound "real" to me either.  
In the first case (text/plain), the page would be displayed with all HTML tags 
visible; only very patients readers would try to decipher it.
In the second case (text/xml), the document would most likely be not displayed 
at all because most HTML documents are not well formed as XML.  

The site admins, not Nutch, must fix this incosistency; I don't think Nutch 
needs to be "smarter" than browsers.
It's actually better for Nutch to miss these pages. I don't want to see a hit 
that leads me to a page that cannot be viewed.

-kuro



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to