[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
     
Stefan Groschupf closed NUTCH-286:
----------------------------------

    Resolution: Won't Fix

I hope everybody agree with the statement: We can not detect http response 
codes based on responded html content.
Prune the index is a good idea to solve the problem.

> Handling common error-pages as 404
> ----------------------------------
>
>          Key: NUTCH-286
>          URL: http://issues.apache.org/jira/browse/NUTCH-286
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" 
> even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The 
> requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly 
> used formulations for "page does not exist" etc. and turn the page into a 404 
> before feeding them  into the nutch-index  - although the server responded 
> with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to