[
https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-2227.
----------------------------------
Resolution: Fixed
Committed to trunk in revision 1731849.
> RegexParseFilter
> ----------------
>
> Key: NUTCH-2227
> URL: https://issues.apache.org/jira/browse/NUTCH-2227
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.11
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2227.patch, NUTCH-2227.patch, NUTCH-2227.patch,
> NUTCH-2227.patch, NUTCH-2227.patch
>
>
> A parse filter that takes a regex and a field name. If regex matches via
> matcher.find() on the HTML. The field name is set to true in the CrawlDatum's
> metadata.
> Combined with the HostDB, it is easy to get a list of hosts that match some
> regex criteria.
> {code}
> # Example configuration file for parsefilter-regex
> #
> # Parse metadata field <name> is set to true if the HTML matches the regex.
> The
> # source can either be html or text. If source is html, the regex is applied
> to
> # the entire HTML tree. If source is text, the regex is applied to the
> # extracted text.
> #
> # format: <name>\t<source>\t<regex>\n
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)