[jira] [Commented] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

ASF GitHub Bot (JIRA) Wed, 18 Jul 2018 06:55:42 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547870#comment-16547870
 ]


ASF GitHub Bot commented on TIKA-2683:
--------------------------------------

kkrugler closed pull request #243: Fix for TIKA-2683 contributed by karanjeets
URL: https://github.com/apache/tika/pull/243
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java
index 4d5cc46d4..191d8b8e2 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java
@@ -21,7 +21,9 @@
 import java.util.BitSet;
 import java.util.List;
 import java.util.Locale;
+import java.util.Set;
 
+import com.google.common.collect.Sets;
 import de.l3s.boilerpipe.BoilerpipeExtractor;
 import de.l3s.boilerpipe.BoilerpipeProcessingException;
 import de.l3s.boilerpipe.document.TextBlock;
@@ -58,6 +60,7 @@
     private int headerCharOffset;
     private List<RecordedElement> elements;
     private TextDocument td;
+    private Set<Character> whitelistCharSet = Sets.newHashSet(' ', '\n', '\r');
     /**
      * Creates a new boilerpipe-based content extractor, using the
      * {@link DefaultExtractor} extraction rules and "delegate" as the content 
handler.
@@ -120,7 +123,7 @@ public void startDocument() throws SAXException {
         headerCharOffset = 0;
 
         if (includeMarkup) {
-            elements = new ArrayList<RecordedElement>();
+            elements = new ArrayList<>();
         }
     }
 
@@ -230,18 +233,24 @@ public void endDocument() throws SAXException {
                     case CONTINUE:
                         // Now emit characters that are valid. Note that 
boilerpipe pre-increments the character index, so
                         // we have to follow suit.
-                        for (char[] chars : element.getCharacters()) {
+                        for (int i = 0; i < element.getCharacters().size(); 
i++) {
+                            char[] chars = element.getCharacters().get(i);
                             curCharsIndex++;
+                            boolean isValidCharacterRun = 
validCharacterRuns.get(curCharsIndex);
 
-                            if (validCharacterRuns.get(curCharsIndex)) {
+                            // 
https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2683
+                            // Allow exempted characters to be written
+                            if (isValidCharacterRun ||
+                                    (chars.length == 1 && 
whitelistCharSet.contains(chars[0]))) {
                                 delegate.characters(chars, 0, chars.length);
+                            }
 
-                                // 
https://issues.apache.org/jira/browse/TIKA-961
-                                if (!Character.isWhitespace(chars[chars.length 
- 1])) {
-                                    // Only add whitespace for certain elements
-                                    if 
(XHTMLContentHandler.ENDLINE.contains(element.getLocalName())) {
-                                        delegate.ignorableWhitespace(NL, 0, 
NL.length);
-                                    }
+                            // https://issues.apache.org/jira/browse/TIKA-961
+                            if (isValidCharacterRun && i == 
element.getCharacters().size() - 1
+                                    && 
!Character.isWhitespace(chars[chars.length - 1])) {
+                                // Only add whitespace for certain elements
+                                if 
(XHTMLContentHandler.ENDLINE.contains(element.getLocalName())) {
+                                    delegate.ignorableWhitespace(NL, 0, 
NL.length);
                                 }
                             }
                         }
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java 
b/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
index 0ad10949d..ab745f36d 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
@@ -29,7 +29,6 @@
 import javax.xml.transform.sax.TransformerHandler;
 import javax.xml.transform.stream.StreamResult;
 import java.io.ByteArrayInputStream;
-import java.io.ByteArrayOutputStream;
 import java.io.File;
 import java.io.IOException;
 import java.io.InputStream;
@@ -54,7 +53,6 @@
 import java.util.concurrent.Future;
 import java.util.regex.Pattern;
 
-import org.apache.commons.codec.binary.Base64;
 import org.apache.tika.Tika;
 import org.apache.tika.TikaTest;
 import org.apache.tika.config.ServiceLoader;
@@ -62,7 +60,6 @@
 import org.apache.tika.detect.AutoDetectReader;
 import org.apache.tika.detect.EncodingDetector;
 import org.apache.tika.exception.TikaException;
-import org.apache.tika.io.IOUtils;
 import org.apache.tika.io.TikaInputStream;
 import org.apache.tika.metadata.Geographic;
 import org.apache.tika.metadata.Metadata;
@@ -70,7 +67,6 @@
 import org.apache.tika.parser.AutoDetectParser;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.parser.Parser;
-import org.apache.tika.parser.RecursiveParserWrapper;
 import org.apache.tika.sax.AbstractRecursiveParserWrapperHandler;
 import org.apache.tika.sax.BodyContentHandler;
 import org.apache.tika.sax.LinkContentHandler;
@@ -923,6 +919,34 @@ public void testBoilerplateWhitespace() throws Exception {
         assertContains("有什么需要我帮你的", content);
     }
 
+    /**
+     * Test case for TIKA-2683
+     *
+     * @see <a 
href="https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2683";>TIKA-2683</a>
+     */
+    @Test
+    public void testBoilerplateMissingWhitespace() throws Exception {
+        String path = "/test-documents/testBoilerplateMissingSpace.html";
+
+        Metadata metadata = new Metadata();
+        BodyContentHandler handler = new BodyContentHandler();
+
+        BoilerpipeContentHandler bpHandler = new 
BoilerpipeContentHandler(handler);
+        bpHandler.setIncludeMarkup(true);
+
+        new HtmlParser().parse(
+                HtmlParserTest.class.getResourceAsStream(path),
+                bpHandler, metadata, new ParseContext());
+
+        String content = handler.toString();
+
+        // Should contain space between these two words as mentioned in HTML
+        assertContains("family Psychrolutidae", content);
+
+        // Shouldn't add new-line chars around brackets; This is not how the 
HTML look
+        assertContains("(Psychrolutes marcidus)", content);
+    }
+
     /**
      * Test case for TIKA-983:  HTML parser should add Open Graph meta tag 
data to Metadata returned by parser
      *
diff --git 
a/tika-parsers/src/test/resources/test-documents/testBoilerplateMissingSpace.html
 
b/tika-parsers/src/test/resources/test-documents/testBoilerplateMissingSpace.html
new file mode 100644
index 000000000..06ea45832
--- /dev/null
+++ 
b/tika-parsers/src/test/resources/test-documents/testBoilerplateMissingSpace.html
@@ -0,0 +1,13 @@
+<!DOCTYPE html>
+<html>
+       <head>
+               <meta charset="UTF-8"/>
+               <title>Blobfish - Wikipedia</title>
+       </head>
+       <body>
+               <p>The <b>blobfish</b> (<i>Psychrolutes marcidus</i>) is a <a 
href="/wiki/Deep_sea_fish" title="Deep sea fish">deep sea fish</a> of the <a 
href="/wiki/Family_(biology)" title="Family (biology)">family</a> <a 
href="/wiki/Psychrolutidae" title="Psychrolutidae">Psychrolutidae</a>. It 
inhabits the deep waters off the coasts of mainland <a href="/wiki/Australia" 
title="Australia">Australia</a> and <a href="/wiki/Tasmania" 
title="Tasmania">Tasmania</a>, as well as the waters of <a 
href="/wiki/New_Zealand" title="New Zealand">New Zealand</a>.<sup 
id="cite_ref-fishbase_1-0" class="reference"><a 
href="#cite_note-fishbase-1">[1]</a></sup></p>
+               <p>Blobfish are typically shorter than 30&#160;cm (12&#160;in). 
They live at depths between 600 and 1,200&#160;m (2,000 and 3,900&#160;ft) 
where the pressure is 60 to 120 times as great as at <a href="/wiki/Sea_level" 
title="Sea level">sea level</a>, which would likely make <a 
href="/wiki/Gas_bladder" class="mw-redirect" title="Gas bladder">gas 
bladders</a> inefficient for maintaining <a href="/wiki/Buoyancy" 
title="Buoyancy">buoyancy</a>.<sup id="cite_ref-fishbase_1-1" 
class="reference"><a href="#cite_note-fishbase-1">[1]</a></sup> Instead, the 
flesh of the blobfish is primarily a gelatinous mass with a <a 
href="/wiki/Density" title="Density">density</a> slightly less than water; this 
allows the fish to float above the sea floor without expending energy on 
swimming. Its relative lack of muscle is not a disadvantage as it primarily 
swallows <a href="/wiki/Marine_snow" title="Marine snow">edible matter</a> that 
floats in front of it such as deep-ocean <a href="/wiki/Crustacean" 
title="Crustacean">crustaceans</a>.<sup id="cite_ref-2" class="reference"><a 
href="#cite_note-2">[2]</a></sup></p>
+               <p>Blobfish are often caught as <a href="/wiki/Bycatch" 
title="Bycatch">bycatch</a> in <a href="/wiki/Bottom_trawling" title="Bottom 
trawling">bottom trawling</a> nets.</p>
+               <p>The popular impression of the blobfish as bulbous and 
gelatinous is partially an artifact of the decompression damage done to 
specimens when they are brought to the surface from the extreme depths in which 
they live.<sup id="cite_ref-3" class="reference"><a 
href="#cite_note-3">[3]</a></sup> In their natural environment, blobfish appear 
more typical of their <a href="/wiki/Superclass_(biology)" class="mw-redirect" 
title="Superclass (biology)">superclass</a> <a href="/wiki/Osteichthyes" 
title="Osteichthyes">Osteichthyes</a> (bony fish).</p>
+       </body>
+</html>


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Missing space and inappropriate new-line in Boilerpipe extracted text
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2683
>                 URL: https://issues.apache.org/jira/browse/TIKA-2683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.18
>         Environment: Replicable everywhere in all environments
>            Reporter: Karanjeet Singh
>            Priority: Major
>              Labels: Boilerplate_Removal, boilerpipe, parser
>             Fix For: 1.19
>
>
> Boilerpipe extractor in Tika miss to capture the space and new-line character 
> in HTML.
> Also, additional new-line characters are inserted in between the text.
> *Example URL* - [https://en.wikipedia.org/wiki/Blobfish]
> Missing space in "family Psychrolutidae" and additional new-line characters 
> around round brackets  '(' 
>  
> Related issue reported long back - 
> https://issues.apache.org/jira/browse/TIKA-961



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2683) Missing space and inappropriate new-line in Boilerpipe extracted text

Reply via email to