[
https://issues.apache.org/jira/browse/NUTCH-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158916#comment-15158916
]
Hudson commented on NUTCH-2227:
-------------------------------
SUCCESS: Integrated in Nutch-trunk #3352 (See
[https://builds.apache.org/job/Nutch-trunk/3352/])
NUTCH-2227 RegexParseFilter (markus:
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1731849])
* trunk/CHANGES.txt
* trunk/build.xml
* trunk/conf/regex-parsefilter.txt
* trunk/default.properties
* trunk/src/plugin/build.xml
* trunk/src/plugin/parsefilter-regex
* trunk/src/plugin/parsefilter-regex/build.xml
* trunk/src/plugin/parsefilter-regex/data
* trunk/src/plugin/parsefilter-regex/data/regex-parsefilter.txt
* trunk/src/plugin/parsefilter-regex/ivy.xml
* trunk/src/plugin/parsefilter-regex/plugin.xml
* trunk/src/plugin/parsefilter-regex/src
* trunk/src/plugin/parsefilter-regex/src/java
* trunk/src/plugin/parsefilter-regex/src/java/org
* trunk/src/plugin/parsefilter-regex/src/java/org/apache
* trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch
* trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter
* trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex
*
trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java
*
trunk/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/package-info.java
* trunk/src/plugin/parsefilter-regex/src/test
* trunk/src/plugin/parsefilter-regex/src/test/org
* trunk/src/plugin/parsefilter-regex/src/test/org/apache
* trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch
* trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch/parsefilter
* trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch/parsefilter/regex
*
trunk/src/plugin/parsefilter-regex/src/test/org/apache/nutch/parsefilter/regex/TestRegexParseFilter.java
> RegexParseFilter
> ----------------
>
> Key: NUTCH-2227
> URL: https://issues.apache.org/jira/browse/NUTCH-2227
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.11
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2227.patch, NUTCH-2227.patch, NUTCH-2227.patch,
> NUTCH-2227.patch, NUTCH-2227.patch
>
>
> A parse filter that takes a regex and a field name. If regex matches via
> matcher.find() on the HTML. The field name is set to true in the CrawlDatum's
> metadata.
> Combined with the HostDB, it is easy to get a list of hosts that match some
> regex criteria.
> {code}
> # Example configuration file for parsefilter-regex
> #
> # Parse metadata field <name> is set to true if the HTML matches the regex.
> The
> # source can either be html or text. If source is html, the regex is applied
> to
> # the entire HTML tree. If source is text, the regex is applied to the
> # extracted text.
> #
> # format: <name>\t<source>\t<regex>\n
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)