Andrzej, thanks for explanation.

How can I distinguish this redirect pages from the 'normal' ones (with
content)? Some status or flag (with parseData.getStatus() I get
success(1,0) for both redirect and normal pages). Can I use HTTP
response code and if so how can I get it (I don't see it in parseData
meta data)?

Tomislav


On Mon, 2008-01-14 at 20:41 +0100, Andrzej Bialecki wrote:
> Tomislav Poljak wrote:
> > Hi, I have been reading data from Nutch segments and came across 
> > pages/records with empty parse text. So I looked more into this and 
> > manually fetched data for this urls. Lots of them are redirect page,
> > but stored into Nutch segment as pages (with meta data but empty
> > parse text). My question is does Nutch get the target page, the page
> > that the original page redirects to?
> 
> Usually - yes, unless the target url is prohibited by your URLFilters 
> configuration. Whether the target page is fetched immediately, or during 
> the next round, depends on settings.
> 
> 
> > Does it get all the information
> > about it (text, meta data...)? Why Nutch stores this empty/redirect
> > pages?
> 
> We need to track known URLs - today this page redirects somewhere, 
> tomorrow it may not. Sometimes the content of redirecting pages is 
> useful in itself (e.g. stock quotes page updated every 60 seconds). 
> Also, it's not always easy to detect a type of redirection - if the 
> redirect is expected in 300 seconds, do you think the content of the 
> redirecting page is useless? Usually it's not. You could also implement 
> an HtmlParseFilter that vetoes certain redirects, or does the opposite 
> i.e. detects a redirect specified as an inline Javascript. Etc. Etc.
> 
> In short, it makes sense to store these pages, and then decide what to 
> do (ignore, process, ...).
> 

Reply via email to