Re: Redirect pages in segment

Andrzej Bialecki Mon, 14 Jan 2008 11:42:21 -0800

Tomislav Poljak wrote:

Hi, I have been reading data from Nutch segments and came acrosspages/records with empty parse text. So I looked more into this andmanually fetched data for this urls. Lots of them are redirect page,
but stored into Nutch segment as pages (with meta data but empty
parse text). My question is does Nutch get the target page, the page
that the original page redirects to?

Usually - yes, unless the target url is prohibited by your URLFiltersconfiguration. Whether the target page is fetched immediately, or duringthe next round, depends on settings.

Does it get all the information
about it (text, meta data...)? Why Nutch stores this empty/redirect
pages?

We need to track known URLs - today this page redirects somewhere,tomorrow it may not. Sometimes the content of redirecting pages isuseful in itself (e.g. stock quotes page updated every 60 seconds).Also, it's not always easy to detect a type of redirection - if theredirect is expected in 300 seconds, do you think the content of theredirecting page is useless? Usually it's not. You could also implementan HtmlParseFilter that vetoes certain redirects, or does the oppositei.e. detects a redirect specified as an inline Javascript. Etc. Etc.

In short, it makes sense to store these pages, and then decide what todo (ignore, process, ...).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirect pages in segment

Reply via email to