This is the behaviour I am noticing with pages that have a server
redirect (300-range code):
Say page A redirects to page B. A is in the fetchlist created by
generate. When A is fetched, the redirect is followed and B is fetched.
At the next updatedb, both A and B go into the crawldb. For some reason,
at the next generate, page B is listed to be fetched. And again at the
next generate, and so on.
An example is:
http://www.selecthotels.com
which redirects to http://203.210.113.143/ ('page B').
This page always seems to be in the fetchlist no matter how many times
it gets fetched. (To make matter more complicated, it also redirects to
yet another URL.)
How do I fix this behaviour?
Also, other URLs whose fetch fails for some reason stay in the crawldb
and are tried again and again. For a 'deep' search using topN=1000, each
fetchlist generated after a number of runs has many hundreds of these
failed URLs that it tries to refetch.
How do I fix this behaviour too?
End of the day for me (NZST). I'll try again tomorrow....
Cheers,
Carl.