[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
Stefan Groschupf closed NUTCH-286:
----------------------------------
Resolution: Won't Fix
I hope everybody agree with the statement: We can not detect http response
codes based on responded html content.
Prune the index is a good idea to solve the problem.
> Handling common error-pages as 404
> ----------------------------------
>
> Key: NUTCH-286
> URL: http://issues.apache.org/jira/browse/NUTCH-286
> Project: Nutch
> Type: Improvement
> Reporter: Stefan Neufeind
>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok"
> even though a specific page could not be found. Example I just found is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The
> requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly
> used formulations for "page does not exist" etc. and turn the page into a 404
> before feeding them into the nutch-index - although the server responded
> with status 200 ok.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers