[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414464 ]
Stefan Neufeind commented on NUTCH-286: --------------------------------------- Well, we _could_ close it, though the question still remains for me. The problem imho is that you say it's hard to do. For sure you could always write searches to prune those pages from the index - but I wonder if that's a clean solution or if it would be better to have a way of excluding certain pages (like these common error-pages, though their header is wrong). I guess it's the typical problem when crawling the web: Technician will say "that webserver/typo3 is wrong and is to be fixed" - but management will not care, and you will have to solve the problem in whatever way. > Handling common error-pages as 404 > ---------------------------------- > > Key: NUTCH-286 > URL: http://issues.apache.org/jira/browse/NUTCH-286 > Project: Nutch > Type: Improvement > Reporter: Stefan Neufeind > > Idea: Some pages from some software-packages/scripts report an "http 200 ok" > even though a specific page could not be found. Example I just found is: > http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef > That's a typo3-page explaining in it's standard-layout and wording: "The > requested page did not exist or was inaccessible." > So I had the idea if somebody might create a plugin that could find commonly > used formulations for "page does not exist" etc. and turn the page into a 404 > before feeding them into the nutch-index - although the server responded > with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
