[ 
https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2227:
---------------------------------
    Description: 
A parse filter that takes a regex and a field name. If regex matches via 
matcher.find() on the HTML. The field name is set to true in the CrawlDatum's 
metadata.

Combined with the HostDB, it is easy to get a list of hosts that match some 
regex criteria.

{code}
# Example configuration file for parsefilter-regex
#
# Parse metadata field <name> is set to true if the HTML matches the regex. The
# source can either be html or text. If source is html, the regex is applied to
# the entire HTML tree. If source is text, the regex is applied to the
# extracted text.
#
# format: <name>\t<source>\t<regex>\n
{code}

  was:
A parse filter that takes a regex and a field name. If regex matches via 
matcher.find() on the HTML. The field name is set to true in the CrawlDatum's 
metadata.

Combined with the HostDB, it is easy to get a list of hosts that match some 
regex criteria.


> RegexParseFilter
> ----------------
>
>                 Key: NUTCH-2227
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2227
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>
> A parse filter that takes a regex and a field name. If regex matches via 
> matcher.find() on the HTML. The field name is set to true in the CrawlDatum's 
> metadata.
> Combined with the HostDB, it is easy to get a list of hosts that match some 
> regex criteria.
> {code}
> # Example configuration file for parsefilter-regex
> #
> # Parse metadata field <name> is set to true if the HTML matches the regex. 
> The
> # source can either be html or text. If source is html, the regex is applied 
> to
> # the entire HTML tree. If source is text, the regex is applied to the
> # extracted text.
> #
> # format: <name>\t<source>\t<regex>\n
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to