> > I have a good idea of how to handle that situation. > > If there are multiple and conflicting values for > > important meta-data such as the content-type, the page > > is horribly broken, and Nutch shouldn't waste effort > > trying to figure out what's going on. For example, if > [..] > > I understand your position, and respectfully disagree. I could give you > a lot of examples of horribly broken servers (among others some versions > of MS IIS), and horribly broken pages that don;t follow any standards - > which nonetheless contain valuable content, and Nutch should be able to > crawl such sites too.
If a search engine is strictly compliant, with standards (HTML and HTTP for instance) it will miss a lot of document and informations. It is one of the major difficulties of search engines (and browsers too) to be permissive and to have some heuristics in order to detect/find the right meta-data (content-type, language, ad so on...) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
