Re: Redirect pages in segment

Andrzej Bialecki Tue, 15 Jan 2008 07:30:36 -0800

Tomislav Poljak wrote:

Andrzej, thanks for explanation.


How can I distinguish this redirect pages from the 'normal' ones (with
content)? Some status or flag (with parseData.getStatus() I get
success(1,0) for both redirect and normal pages). Can I use HTTP
response code and if so how can I get it (I don't see it in parseData
meta data)?

Nutch recognizes two different redirection methods. One is at theprotocol level (mostly in case of HTTP), the other is at the contentlevel (such as HTML meta tags, and Javascript redirects).

So there are two places where Nutch may record that a redirect hasoccurred: one is in the ProtocolStatus (stored in crawl_fetch, and lateradded to crawldb during updatedb operation), the other one is inParseStatus.SUCCESS_REDIRECT (stored in parse_data).

In the standard Nutch workflow both pieces of information are availableonly during indexing, so you can implement an IndexingFilter to dosomething.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirect pages in segment

Reply via email to