Tomislav Poljak wrote:
Andrzej, thanks for explanation.
How can I distinguish this redirect pages from the 'normal' ones (with
content)? Some status or flag (with parseData.getStatus() I get
success(1,0) for both redirect and normal pages). Can I use HTTP
response code and if so how can I get it (I don't see it in parseData
meta data)?
Nutch recognizes two different redirection methods. One is at the
protocol level (mostly in case of HTTP), the other is at the content
level (such as HTML meta tags, and Javascript redirects).
So there are two places where Nutch may record that a redirect has
occurred: one is in the ProtocolStatus (stored in crawl_fetch, and later
added to crawldb during updatedb operation), the other one is in
ParseStatus.SUCCESS_REDIRECT (stored in parse_data).
In the standard Nutch workflow both pieces of information are available
only during indexing, so you can implement an IndexingFilter to do
something.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com