[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170556#comment-15170556
 ] 

ASF GitHub Bot commented on NUTCH-961:
--------------------------------------

Github user lewismc commented on a diff in the pull request:

    https://github.com/apache/nutch/pull/92#discussion_r54332193
  
    --- Diff: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/BoilerpipeExtractorRepository.java
 ---
    @@ -0,0 +1,62 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.nutch.parse.tika;
    +
    +import java.lang.ClassLoader;
    +import java.lang.InstantiationException;
    +import java.util.WeakHashMap;
    +import org.apache.commons.logging.Log;
    +import org.apache.commons.logging.LogFactory;
    +import org.apache.tika.parser.html.BoilerpipeContentHandler;
    +import de.l3s.boilerpipe.BoilerpipeExtractor;
    +import de.l3s.boilerpipe.extractors.*;
    +
    +class BoilerpipeExtractorRepository {
    +
    +    public static final Log LOG = 
LogFactory.getLog(BoilerpipeExtractorRepository.class);
    +    public static final WeakHashMap<String, BoilerpipeExtractor> 
extractorRepository = new WeakHashMap<String, BoilerpipeExtractor>();
    + 
    +    /**
    +     * Returns an instance of the specified extractor
    +     */
    +    public static BoilerpipeExtractor getExtractor(String 
boilerpipeExtractorName) {
    +      // Check if there's no instance of this extractor
    +      if (!extractorRepository.containsKey(boilerpipeExtractorName)) {
    +        // FQCN
    +        boilerpipeExtractorName = "de.l3s.boilerpipe.extractors." + 
boilerpipeExtractorName;
    +
    +        // Attempt to load the class
    +        try {
    +          ClassLoader loader = BoilerpipeExtractor.class.getClassLoader();
    +          Class extractorClass = loader.loadClass(boilerpipeExtractorName);
    +
    +          // Add an instance to the repository
    +          extractorRepository.put(boilerpipeExtractorName, 
(BoilerpipeExtractor)extractorClass.newInstance());
    +
    +        } catch (ClassNotFoundException e) {
    +          LOG.error("BoilerpipeExtractor " + boilerpipeExtractorName + " 
not found!");
    --- End diff --
    
    In slf4j we can better structure the catch
    http://www.slf4j.org/faq.html#logging_performance
    e.g.
    ```
    logger.debug("The entry is {}.", entry);
    ```


> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961.patch, NUTCH-961.patch, NUTCH-961v2.patch, 
> nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.
> Use the following properties to enable and control Boilerpipe.
> {code}
> <property>
>   <name>tika.extractor</name>
>   <value>none</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or 
> none.
>   </description>
> </property>
>  
> <property> 
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description> 
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to