Markus Jelsma created NUTCH-2227:
------------------------------------
Summary: RegexParseFilter
Key: NUTCH-2227
URL: https://issues.apache.org/jira/browse/NUTCH-2227
Project: Nutch
Issue Type: New Feature
Components: parser
Affects Versions: 1.11
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.12
A parse filter that takes a regex and a field name. If regex matches via
matcher.find() on the HTML. The field name is set to true in the CrawlDatum's
metadata.
Combined with the HostDB, it is easy to get a list of hosts that match some
regex criteria.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)