Tien Nguyen Manh created NUTCH-1705:
---------------------------------------
Summary: Make configuration option for HtmlParser & TikaParser to
extract text or title for noIndex page
Key: NUTCH-1705
URL: https://issues.apache.org/jira/browse/NUTCH-1705
Project: Nutch
Issue Type: Improvement
Reporter: Tien Nguyen Manh
Priority: Minor
Currently HtmlParser and TikaParser always skip extracting text and title for
noIndex page - page which have noIndex robots metatags.
But some parse-filter may still interested in text and title such as
NUTCH-1661, where we may decide wether to follow a page by it's language.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)