Github user chrismattmann commented on a diff in the pull request:
https://github.com/apache/nutch/pull/34#discussion_r32870336
--- Diff: src/java/org/apache/nutch/parse/ParseSegment.java ---
@@ -69,6 +77,35 @@ public void configure(JobConf job) {
setConf(job);
this.scfilters = new ScoringFilters(job);
skipTruncated = job.getBoolean(SKIP_TRUNCATED, true);
+
+ filterflag = job.getBoolean(PARSER_MODELFILTER, true);
+ if (filterflag) {
+ String[] args = new String[2];
+ args[0] = getConf().get(TRAINFILE_MODELFILTER);
+ args[1] = getConf().get(DICTFILE_MODELFILTER);
+
+ if (args[0] == null || args[0].trim().length() == 0 || args[1] ==
null
+ || args[1].trim().length() == 0) {
+ String message = "Model URLFilter: trainfile or wordlist not set
in the urlfilter.model.trainfile or urlfilter.model.wordlist";
+ if (LOG.isErrorEnabled()) {
+ filterflag = false;
+ LOG.error(message);
+ }
+ throw new IllegalArgumentException(message);
+ } else {
+ try {
+ filters = new URLFilters(job);
+ filter = (ModelURLFilterAbstract) filters
--- End diff --
This ties us into using a specific filter, the ModelURLFilter, in the core
Nutch classes. Why can't the URL filter simply be insulated to the plugin -
this shouldn't have to touch the Nutch core?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---