There are two different types of redirect. When a web site returns a 301 status (redirect permanent), it means "the url you requested is no longer valid, don't ask for it again". When it returns a 307 status (temporary redirect), it means "keep asking for the url you asked for, and I'll tell you where to go from there". In the first case, Nutch should remove the first URL from its database and put the redirection target in in its place. In the second case, Nutch should leave the original URL in its database, but also go to the redirection target. I don't know if that's actually what Nutch does, but I assume so.
On Tue, Oct 27, 2009 at 11:30 AM, caezar <caeza...@gmail.com> wrote: > > Hi All, > > I've done some googling, but found different answers, so I would appreciate > if you tell me which is the correct one: > - when page redirected, content of target page is fetched and associated > with the source (initial) page URL > - when page redirected, new entry with the redirect target url and contents > added to the db > > If the second option is the correct one, then one more question. When I have > a NutchDocument instance which represents target URL, is that possible to > retrieve it's redirect source URL somehow? > > Thanks > -- > View this message in context: > http://www.nabble.com/Redirect-handling-tp26079767p26079767.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- http://www.linkedin.com/in/paultomblin