[ 
https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158916#comment-15158916
 ] 

Hudson commented on NUTCH-2227:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #3352 (See 
[https://builds.apache.org/job/Nutch-trunk/3352/])
NUTCH-2227 RegexParseFilter (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1731849])
* trunk/CHANGES.txt
* trunk/build.xml
* trunk/conf/regex-parsefilter.txt
* trunk/default.properties
* trunk/src/plugin/build.xml
* trunk/src/plugin/parsefilter-regex
* trunk/src/plugin/parsefilter-regex/build.xml
* trunk/src/plugin/parsefilter-regex/data
* trunk/src/plugin/parsefilter-regex/data/regex-parsefilter.txt
* trunk/src/plugin/parsefilter-regex/ivy.xml
* trunk/src/plugin/parsefilter-regex/plugin.xml
* trunk/src/plugin/parsefilter-regex/src
* trunk/src/plugin/parsefilter-regex/src/java
* trunk/src/plugin/parsefilter-regex/src/java/org
* trunk/src/plugin/parsefilter-regex/src/java/org/apache
* trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch
* trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter
* trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex
* 
trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java
* 
trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/package-info.java
* trunk/src/plugin/parsefilter-regex/src/test
* trunk/src/plugin/parsefilter-regex/src/test/org
* trunk/src/plugin/parsefilter-regex/src/test/org/apache
* trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch
* trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch/parsefilter
* trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch/parsefilter/regex
* 
trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch/parsefilter/regex/TestRegexParseFilter.java


> RegexParseFilter
> ----------------
>
>                 Key: NUTCH-2227
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2227
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-2227.patch, NUTCH-2227.patch, NUTCH-2227.patch, 
> NUTCH-2227.patch, NUTCH-2227.patch
>
>
> A parse filter that takes a regex and a field name. If regex matches via 
> matcher.find() on the HTML. The field name is set to true in the CrawlDatum's 
> metadata.
> Combined with the HostDB, it is easy to get a list of hosts that match some 
> regex criteria.
> {code}
> # Example configuration file for parsefilter-regex
> #
> # Parse metadata field <name> is set to true if the HTML matches the regex. 
> The
> # source can either be html or text. If source is html, the regex is applied 
> to
> # the entire HTML tree. If source is text, the regex is applied to the
> # extracted text.
> #
> # format: <name>\t<source>\t<regex>\n
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to