There are three kinds of "redirects". One is where the server behind
the scenes forwards to a different page and returns the output. This is
usually called a forward. Two is where the server send a redirect code
(usually in the 300 range). The browser then requests the page it was
redirected to. This is usually called a protocol redirect or just a
redirect in JSP and ASP terms. Three is where the page has a
meta-refresh tag in the header. This is known as a content redirect or
a meta redirect. Here the client doesn't get a redirect code from the
header but after a certain amount of time will request the page in the
url section of the meta-refresh tag.
If (www.domain.com/?code.asp&redirect=444) sends a forward then nutch
doesn't know anything about it and will just index the content returned
under the original url. If it sends a protocol redirect, then nutch
goes and requests the new page and will index the new page under the new
url. Nutch will follow redirects up to http.redirect.max times. So if
the redirect page redirects again Nutch will follow that one as well up
to the max times. If the url variable "redirect" is used to populate a
meta-refresh tag then as of right now Nutch won't follow the redirect.
I think it fails with a NullPointer right now.
The meta-refresh was working in 7.2 but is broken in 0.8. Andrzej
Bialecki said he was looking into fixing it. Hope this helps you
understand what is happening with the fetch.
Dennis
Insurance Squared Inc. wrote:
Perhaps a point of clarification - I'm assuming that the
www.domain.com/?code.asp&redirect=444 actually sends a redirect header
to the new page. In that case (I don't know enough about protocols
personally to be sure) it seems that nutch would have to recognize
that it's being redirected and refetch at the new location. Am I
correct? And if so, wouldn't nutch then index and display the new,
redirected page?
I'm using version .7 btw.
thanks,
Glenn
Dennis Kubes wrote:
Protocol level redirects (asp redirects), meaning the server sends a
redirect response 3xx code, work correctly in Nutch 0.8 dev. It
processes it as a completely new page. If you are doing asp forwards
I believe that the original page
(www.domain.com/?code.aspx&redirect=445454) would be the URL that
shows up in the search because Nutch doesn't know what is going on
behind the scenes in the ASP code. It knows url and content recieved.
As of right now in 0.8 dev meta level redirects (meta refesh tags)
don't work correctly. They did in 0.7 but I don't think that
functionality has been ported.
Dennis
Insurance Squared Inc. wrote:
How are redirects listed in version 0.7? If the crawler finds a
link like:
www.domain.com/?code.aspx&redirect=445454
and that link redirects through to www.another-domain.com, which of
those two links will show up in nutch?
(I'm wondering if I can use nutch to crawl sites with a lot of
redirects, and still end up with the correct redirected domain in
the listings).
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general