This is the behaviour I am noticing with pages that have a server redirect (300-range code):

Say page A redirects to page B. A is in the fetchlist created by generate. When A is fetched, the redirect is followed and B is fetched. At the next updatedb, both A and B go into the crawldb. For some reason, at the next generate, page B is listed to be fetched. And again at the next generate, and so on.


An example is:

http://www.selecthotels.com

which redirects to http://203.210.113.143/ ('page B').
This page always seems to be in the fetchlist no matter how many times it gets fetched. (To make matter more complicated, it also redirects to yet another URL.)

How do I fix this behaviour?

Also, other URLs whose fetch fails for some reason stay in the crawldb and are tried again and again. For a 'deep' search using topN=1000, each fetchlist generated after a number of runs has many hundreds of these failed URLs that it tries to refetch.

How do I fix this behaviour too?



End of the day for me (NZST). I'll try again tomorrow....

Cheers,
Carl.

Reply via email to