Jérôme, >> Why should Nutch treat it as HTML? > > Simply because it is a HTML file, with a strange name, of course, but > it is a HTML file. > My example is a kind of "caricature". But some more real case could be > : a HTML file with a text/plain content-type, or with an text/xml
These cases don't sound "real" to me either. In the first case (text/plain), the page would be displayed with all HTML tags visible; only very patients readers would try to decipher it. In the second case (text/xml), the document would most likely be not displayed at all because most HTML documents are not well formed as XML. The site admins, not Nutch, must fix this incosistency; I don't think Nutch needs to be "smarter" than browsers. It's actually better for Nutch to miss these pages. I don't want to see a hit that leads me to a page that cannot be viewed. -kuro ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
