[
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170556#comment-15170556
]
ASF GitHub Bot commented on NUTCH-961:
--------------------------------------
Github user lewismc commented on a diff in the pull request:
https://github.com/apache/nutch/pull/92#discussion_r54332193
--- Diff:
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parse.tika;
+
+import java.lang.ClassLoader;
+import java.lang.InstantiationException;
+import java.util.WeakHashMap;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.tika.parser.html.BoilerpipeContentHandler;
+import de.l3s.boilerpipe.BoilerpipeExtractor;
+import de.l3s.boilerpipe.extractors.*;
+
+class BoilerpipeExtractorRepository {
+
+ public static final Log LOG =
LogFactory.getLog(BoilerpipeExtractorRepository.class);
+ public static final WeakHashMap<String, BoilerpipeExtractor>
extractorRepository = new WeakHashMap<String, BoilerpipeExtractor>();
+
+ /**
+ * Returns an instance of the specified extractor
+ */
+ public static BoilerpipeExtractor getExtractor(String
boilerpipeExtractorName) {
+ // Check if there's no instance of this extractor
+ if (!extractorRepository.containsKey(boilerpipeExtractorName)) {
+ // FQCN
+ boilerpipeExtractorName = "de.l3s.boilerpipe.extractors." +
boilerpipeExtractorName;
+
+ // Attempt to load the class
+ try {
+ ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
+ Class extractorClass = loader.loadClass(boilerpipeExtractorName);
+
+ // Add an instance to the repository
+ extractorRepository.put(boilerpipeExtractorName,
(BoilerpipeExtractor)extractorClass.newInstance());
+
+ } catch (ClassNotFoundException e) {
+ LOG.error("BoilerpipeExtractor " + boilerpipeExtractorName + "
not found!");
--- End diff --
In slf4j we can better structure the catch
http://www.slf4j.org/faq.html#logging_performance
e.g.
```
logger.debug("The entry is {}.", entry);
```
> Expose Tika's boilerpipe support
> --------------------------------
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.11
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: BoilerpipeExtractorRepository.java,
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch,
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch,
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch,
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch,
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch,
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to
> extract boilerplate content from HTML pages. We should see how we can expose
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> <property>
> <name>tika.extractor</name>
> <value>none</value>
> <description>
> Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
> </description>
> </property>
>
> <property>
> <name>tika.extractor.boilerpipe.algorithm</name>
> <value>ArticleExtractor</value>
> <description>
> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
> or CanolaExtractor.
> </description>
> </property>
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)