[
https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295051#comment-16295051
]
ASF GitHub Bot commented on NUTCH-2359:
---------------------------------------
sebastian-nagel closed pull request #178: NUTCH-2359 RegexParseFilter:
ill-formed rules raise error
URL: https://github.com/apache/nutch/pull/178
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/src/plugin/parsefilter-regex/README.txt
b/src/plugin/parsefilter-regex/README.txt
new file mode 100644
index 000000000..9cbfbf170
--- /dev/null
+++ b/src/plugin/parsefilter-regex/README.txt
@@ -0,0 +1,37 @@
+Parsefilter-regex plugin
+
+Allow parsing and set custom defined fields using regex. Rules can be defined
in a separate rule file or in the nutch configuration.
+
+If a rule file is used, should create a text file regex-parsefilter.txt (which
is the default name of the rules file). To use a different filename, either
update the file value in plugin’s build.xml or add parsefilter.regex.file
config to the nutch config.
+
+ie:
+ <property>
+ <name>parsefilter.regex.file</name>
+ <value>
+ /path/to/rulefile
+ </value>
+ </property
+
+
+Format of rules: <name>\t<source>\t<regex>\n
+
+ie:
+ my_first_field html h1
+ my_second_field text my_pattern
+
+
+If a rule file is not used, rules can be directly set in the nutch config:
+
+ie:
+ <property>
+ <name>parsefilter.regex.rules</name>
+ <value>
+ my_first_field html h1
+ my_second_field text my_pattern
+ </value>
+ </property
+
+source can be either html or text. If source is html, the regex is applied to
+the entire HTML tree. If source is text, the regex is applied to the
+extracted text.
+
diff --git
a/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java
b/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java
index 695516668..f799e5f48 100644
---
a/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java
+++
b/src/plugin/parsefilter-regex/src/java/org/apache/nutch/parsefilter/regex/RegexParseFilter.java
@@ -179,13 +179,17 @@ private synchronized void readConfiguration(Reader
configReader) throws IOExcept
while ((line = reader.readLine()) != null) {
if (StringUtils.isNotBlank(line) && !line.startsWith("#")) {
line = line.trim();
- String[] parts = line.split("\t");
-
- String field = parts[0].trim();
- String source = parts[1].trim();
- String regex = parts[2].trim();
-
- rules.put(field, new RegexRule(source, regex));
+ String[] parts = line.split("\\s");
+
+ if (parts.length == 3) {
+ String field = parts[0].trim();
+ String source = parts[1].trim();
+ String regex = parts[2].trim();
+
+ rules.put(field, new RegexRule(source, regex));
+ } else {
+ LOG.info("RegexParseFilter rule is invalid. " + line);
+ }
}
}
}
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
> ----------------------------------------------------------------------------
>
> Key: NUTCH-2359
> URL: https://issues.apache.org/jira/browse/NUTCH-2359
> Project: Nutch
> Issue Type: Bug
> Components: plugin
> Affects Versions: 1.12
> Reporter: Laknath Semage
> Assignee: Markus Jelsma
> Priority: Minor
> Labels: patch
> Fix For: 1.13
>
>
> This patch fixes:
> 1) [Bug] Parsefilter-regex raises IndexOutOfBoundsException when rules are
> ill-formed
> 2) Rules are split using any space character (\s) instead tab (\t)
> 3) A detailed Readme for the plugin
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)