Re: Get target URL of redirects

Markus Jelsma Mon, 30 Jan 2012 05:36:53 -0800

Hi,

This is not how Nutch behaves with redirects or how its state is implemented 
just about why retrieving the redirect from a CrawlDatum via Metadata via 
arguments is so tedious.


Cheers,


On Saturday 21 January 2012 14:25:17 Lewis John Mcgibbney wrote:
> Hi Markus,
> 
> As I'm sure you are aware there is quite a bit of literature on the rather
> extensive topic area of URL redirects, hopefully by discussing we can try
> to clarifying wtf to your specific question(s).
> 
> I think the main problem here is that there are numerous methods for
> implementing redirects, most of which seem to communicate the redirect in
> different ways.
> On Thu, Jan 19, 2012 at 4:25 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > Hi,
> > 
> > Why is it so hard to get the target URL of a redirect? I have to get the
> > protocolstatus out of the crawl datum's metadata and then get the first
> > arg of
> > ProtocolStatus' args?
> 
> This (I assume) would be the most convenient method for doing so, however
> as you mention, there is the issue of the target URL not being the 1st arg?
> 
> > Can it have more than 1 arg?
> 
> I think we would expect the HTTP protocol status line and the
> Content-length header (which the web server usually adds automatically)
> before the target URL...
> 
> Is there a decent method to get the URL? At first
> 
> > i assumed _repr_ key would return the target URL but that key doesn't
> > seem to
> > exist for some test redirects i have.
> 
> What is the nature of the redirects and how are they specified? Manual
> redirects, HTTP refresh header, etc?

-- 
Markus Jelsma - CTO - Openindex

Re: Get target URL of redirects

Reply via email to