> > I have a good idea of how to handle that situation.
> > If there are multiple and conflicting values for
> > important meta-data such as the content-type, the page
> > is horribly broken, and Nutch shouldn't waste effort
> > trying to figure out what's going on. For example, if
> [..]
> 
> I understand your position, and respectfully disagree. I could give you
> a lot of examples of horribly broken servers (among others some versions
> of MS IIS), and horribly broken pages that don;t follow any standards -
> which nonetheless contain valuable content, and Nutch should be able to
> crawl such sites too.

If a search engine is strictly compliant, with standards (HTML and HTTP for 
instance) it will miss a lot of document and informations.
It is one of the major difficulties of search engines (and browsers too) to 
be permissive and to have some heuristics in order to detect/find the right 
meta-data (content-type, language, ad so on...)

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Reply via email to