[Nutch-dev] [jira] Commented: (NUTCH-286) Handling common error-pages as 404

Stefan Neufeind (JIRA) Fri, 02 Jun 2006 09:25:12 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414464 ]


Stefan Neufeind commented on NUTCH-286:
---------------------------------------

Well, we _could_  close it, though the question still remains for me. The 
problem imho is that you say it's hard to do.
For sure you could always write searches to prune those pages from the index - 
but I wonder if that's a clean solution or if it would be better to have a way 
of excluding certain pages (like these common error-pages, though their header 
is wrong). I guess it's the typical problem when crawling the web: Technician 
will say  "that webserver/typo3 is wrong and is to be fixed" - but management 
will not care, and you will have to solve the problem in  whatever way.

> Handling common error-pages as 404
> ----------------------------------
>
>          Key: NUTCH-286
>          URL: http://issues.apache.org/jira/browse/NUTCH-286
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" 
> even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The 
> requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly 
> used formulations for "page does not exist" etc. and turn the page into a 404 
> before feeding them  into the nutch-index  - although the server responded 
> with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-286) Handling common error-pages as 404

Reply via email to