Tomislav Poljak wrote:
Hi, I have been reading data from Nutch segments and came across pages/records with empty parse text. So I looked more into this and manually fetched data for this urls. Lots of them are redirect page,
but stored into Nutch segment as pages (with meta data but empty
parse text). My question is does Nutch get the target page, the page
that the original page redirects to?

Usually - yes, unless the target url is prohibited by your URLFilters configuration. Whether the target page is fetched immediately, or during the next round, depends on settings.


Does it get all the information
about it (text, meta data...)? Why Nutch stores this empty/redirect
pages?

We need to track known URLs - today this page redirects somewhere, tomorrow it may not. Sometimes the content of redirecting pages is useful in itself (e.g. stock quotes page updated every 60 seconds). Also, it's not always easy to detect a type of redirection - if the redirect is expected in 300 seconds, do you think the content of the redirecting page is useless? Usually it's not. You could also implement an HtmlParseFilter that vetoes certain redirects, or does the opposite i.e. detects a redirect specified as an inline Javascript. Etc. Etc.

In short, it makes sense to store these pages, and then decide what to do (ignore, process, ...).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to