Andrzej, thanks for explanation. How can I distinguish this redirect pages from the 'normal' ones (with content)? Some status or flag (with parseData.getStatus() I get success(1,0) for both redirect and normal pages). Can I use HTTP response code and if so how can I get it (I don't see it in parseData meta data)?
Tomislav On Mon, 2008-01-14 at 20:41 +0100, Andrzej Bialecki wrote: > Tomislav Poljak wrote: > > Hi, I have been reading data from Nutch segments and came across > > pages/records with empty parse text. So I looked more into this and > > manually fetched data for this urls. Lots of them are redirect page, > > but stored into Nutch segment as pages (with meta data but empty > > parse text). My question is does Nutch get the target page, the page > > that the original page redirects to? > > Usually - yes, unless the target url is prohibited by your URLFilters > configuration. Whether the target page is fetched immediately, or during > the next round, depends on settings. > > > > Does it get all the information > > about it (text, meta data...)? Why Nutch stores this empty/redirect > > pages? > > We need to track known URLs - today this page redirects somewhere, > tomorrow it may not. Sometimes the content of redirecting pages is > useful in itself (e.g. stock quotes page updated every 60 seconds). > Also, it's not always easy to detect a type of redirection - if the > redirect is expected in 300 seconds, do you think the content of the > redirecting page is useless? Usually it's not. You could also implement > an HtmlParseFilter that vetoes certain redirects, or does the opposite > i.e. detects a redirect specified as an inline Javascript. Etc. Etc. > > In short, it makes sense to store these pages, and then decide what to > do (ignore, process, ...). >
