Hi, This is not how Nutch behaves with redirects or how its state is implemented just about why retrieving the redirect from a CrawlDatum via Metadata via arguments is so tedious.
Cheers, On Saturday 21 January 2012 14:25:17 Lewis John Mcgibbney wrote: > Hi Markus, > > As I'm sure you are aware there is quite a bit of literature on the rather > extensive topic area of URL redirects, hopefully by discussing we can try > to clarifying wtf to your specific question(s). > > I think the main problem here is that there are numerous methods for > implementing redirects, most of which seem to communicate the redirect in > different ways. > On Thu, Jan 19, 2012 at 4:25 PM, Markus Jelsma > > <[email protected]>wrote: > > Hi, > > > > Why is it so hard to get the target URL of a redirect? I have to get the > > protocolstatus out of the crawl datum's metadata and then get the first > > arg of > > ProtocolStatus' args? > > This (I assume) would be the most convenient method for doing so, however > as you mention, there is the issue of the target URL not being the 1st arg? > > > Can it have more than 1 arg? > > I think we would expect the HTTP protocol status line and the > Content-length header (which the web server usually adds automatically) > before the target URL... > > Is there a decent method to get the URL? At first > > > i assumed _repr_ key would return the target URL but that key doesn't > > seem to > > exist for some test redirects i have. > > What is the nature of the redirects and how are they specified? Manual > redirects, HTTP refresh header, etc? -- Markus Jelsma - CTO - Openindex

