[
https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548220#comment-16548220
]
Hudson commented on TIKA-2683:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1521 (See
[https://builds.apache.org/job/Tika-trunk/1521/])
Fix for TIKA-2683 contributed by karanjeets (karanjeet_singh:
[https://github.com/apache/tika/commit/60bba0696868fe1bb027e2ab3e5b7bbfbd0b75cc])
* (add)
tika-parsers/src/test/resources/test-documents/testBoilerplateMissingSpace.html
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java
Add info about TIKA-2683 fix (github:
[https://github.com/apache/tika/commit/80f6dc6e5cc10bb42590fcdf4540e59487f13fc6])
* (edit) CHANGES.txt
> Missing space and inappropriate new-line in Boilerpipe extracted text
> ---------------------------------------------------------------------
>
> Key: TIKA-2683
> URL: https://issues.apache.org/jira/browse/TIKA-2683
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.18
> Environment: Replicable everywhere in all environments
> Reporter: Karanjeet Singh
> Assignee: Ken Krugler
> Priority: Major
> Labels: Boilerplate_Removal, boilerpipe, parser
> Fix For: 1.19
>
>
> Boilerpipe extractor in Tika miss to capture the space and new-line character
> in HTML.
> Also, additional new-line characters are inserted in between the text.
> *Example URL* - [https://en.wikipedia.org/wiki/Blobfish]
> Missing space in "family Psychrolutidae" and additional new-line characters
> around round brackets '('
>
> Related issue reported long back -
> https://issues.apache.org/jira/browse/TIKA-961
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)