[ 
https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548220#comment-16548220
 ] 

Hudson commented on TIKA-2683:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1521 (See 
[https://builds.apache.org/job/Tika-trunk/1521/])
Fix for TIKA-2683 contributed by karanjeets (karanjeet_singh: 
[https://github.com/apache/tika/commit/60bba0696868fe1bb027e2ab3e5b7bbfbd0b75cc])
* (add) 
tika-parsers/src/test/resources/test-documents/testBoilerplateMissingSpace.html
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java
Add info about TIKA-2683 fix (github: 
[https://github.com/apache/tika/commit/80f6dc6e5cc10bb42590fcdf4540e59487f13fc6])
* (edit) CHANGES.txt


> Missing space and inappropriate new-line in Boilerpipe extracted text
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2683
>                 URL: https://issues.apache.org/jira/browse/TIKA-2683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.18
>         Environment: Replicable everywhere in all environments
>            Reporter: Karanjeet Singh
>            Assignee: Ken Krugler
>            Priority: Major
>              Labels: Boilerplate_Removal, boilerpipe, parser
>             Fix For: 1.19
>
>
> Boilerpipe extractor in Tika miss to capture the space and new-line character 
> in HTML.
> Also, additional new-line characters are inserted in between the text.
> *Example URL* - [https://en.wikipedia.org/wiki/Blobfish]
> Missing space in "family Psychrolutidae" and additional new-line characters 
> around round brackets  '(' 
>  
> Related issue reported long back - 
> https://issues.apache.org/jira/browse/TIKA-961



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to