Re: Get target URL of redirects

Lewis John Mcgibbney Sat, 21 Jan 2012 05:25:48 -0800

Hi Markus,

As I'm sure you are aware there is quite a bit of literature on the rather
extensive topic area of URL redirects, hopefully by discussing we can try
to clarifying wtf to your specific question(s).

I think the main problem here is that there are numerous methods for
implementing redirects, most of which seem to communicate the redirect in
different ways.
On Thu, Jan 19, 2012 at 4:25 PM, Markus Jelsma
<[email protected]>wrote:

> Hi,
>
> Why is it so hard to get the target URL of a redirect? I have to get the
> protocolstatus out of the crawl datum's metadata and then get the first
> arg of
> ProtocolStatus' args?
>
This (I assume) would be the most convenient method for doing so, however
as you mention, there is the issue of the target URL not being the 1st arg?

>
> Can it have more than 1 arg?

I think we would expect the HTTP protocol status line and the
Content-length header (which the web server usually adds automatically)
before the target URL...

Is there a decent method to get the URL? At first
> i assumed _repr_ key would return the target URL but that key doesn't seem
> to
> exist for some test redirects i have.
>

What is the nature of the redirects and how are they specified? Manual
redirects, HTTP refresh header, etc?

-- 
*Lewis*

Re: Get target URL of redirects

Reply via email to